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ABSTRACT 

This study was undertaken to develop guidelines for 
making interpretive inferences from scores on the Test of English for 
International Communication (TOEIC), a norm-referenced test of 
English-language listening comprehension (LC) and reading (R) skills, 
about level of ability to use English in face~to“face conversation, 
indexed by performance in the Language Proficiency Interview (LPIj 
situation. LPI performance, rated according to behaviorally defined 
levels on the LPI/ILR/FSI quasi-absolute proficiency scale, was 
treated as a context-independent criterion, using the familiar 
regression model in an apparently novel application (for such 
criterion-referenced purposes) in the context of a large-scale 
ESL-testing program. The study employed TOEIC/LPI data-sets generated 
during operational ESL assessments in representative TOEIC-use 
settings (places of work or work-related ESL training) in Japan, 
France, Mexico, and Saudia Arabia, involving samples of adult, 
educated ESL us ers / 1 earner s in or preparing for ESL~es sen t i a 1 
positions with companies engaged in international commerce. The 
pattern of TOEIC/LPI concurrent correlations was consistent across 
samples and there was relatively close fit between sample LPI means 
and estimates from TOEIC scores, especially TOEIC-LC, using 
combined-sample regression equations. Theoretical and pragmatic 
implications of the findings are discussed. General guidelines are 
provided for making inferences about LPI~assessed level of oral 
English proficiency from TOEIC scores. Directions are suggested for 
further research and development activities in the TOEIC testing 
context . (Author) 



* Vc * * * Vc * * * * Vc Vc * Vc Vc Vc * * * * * * * Vc Vc * Vc * * * * * * Vc * * * * Vc * * * * Vc Vc * * Vc * * * * * Vc * * * * * * Vc * * * Vc Vc * * * * * 



* Reproductions supplied by EDRS are the best 

* from the original document 

* Vc * * Vc Vc Vc * * * Vc * Vc Vc * * * * * * * * * * * * * Vc * * * * * * * * Vc Vc V* * * * Vc * * * Vc * : 



that can be made 

Vc Vc * * Vc Vc Vc Vc Vc Vc Vc Vc Vc Vc Vc Vc Vc Vc 




ED 395 93 



TOEIC 

RESEARCH 

REPORT 

NUMBER 1 
SEPTEMBER 1 989 




U S DEPARTMENT OF EDUCATION 

Office of EducahOnai Research and improvement 

EDUCATIONAL RESOURCES INFORMATION 
/ CENTER (ERIC) 

CT fhis document has been reproduced as 
received from the person or organization 
Originating it 

D Minor changes have been made fo improve 
reproduction quality 

• Pomts of view or opinions stated m this docu 
meni do not necessarily reoresent official 
OERl position or policy 



Enhancing the Interpretation of a Norm-Referenced 
Second-Language Test through Criterion Referencing. 
A Research Assessment of Experience in the 
TOEIC Testing Context 



Kenneth M. Wilson 



ERIC 




EDUCATIONAL 'EStlNG SERVICE 



best copy available 




Enhancing the Interpretation of a Norm Referenced 
Second-Language Test through Criterion-Referencing: 
A Research Assessment of Experience in 
the TOEIC Testing Context 



KenntSi M. Wilson 



Educational Testing Service 
Princeton, New Jersey 



Copyright © 1989 by Educational Testing Service, Princeton, NJ. 

All rights reserved. 

T0E1C , the TOEIC logo, ETS , and the ETS logo are trademarks of 
Educational Testing Service, registered in the USA and in many 
other countries. 

Unauthorized reproduction in whole or in part is prohibited. 

Educational Testing Service is an Equal Opportunity /Af f irmative 
Action Employer. 




Abstract 



This study was undertaken to develop guidelines fcr making in- 
terpretive inferences from scores on the Test of Eng 1 ^! or 
International Communication (TOEIC), a norm- referenced test of 
Enelish- language listening comprehension (LC) and reading (R) ski , 
about level" of ability to use English in face-to-face conversation 
indexed by performance in the Language Proficiency Interview ( ) 

situation. LPI performance, rated according to behaviorally defined 
levels on the LPI/ILR/FS1 quasi -absolute proficiency scale, was 
treated as a context- independent criterion, using _ the familiar 
regression model in an apparently novel application (for such 
criterion-referencing purposes) in the context of a large-scale ESL- 
testing program. The study employed TOEIC/LPI data-sets generated 
during operational ESL assessments m representative TOEIC-use 
settings (places of work or work-related ESL training) in J jjpan, 
France, Mexico, and Saudia Arabia, involving samples of adult 
educated ESL users/learners in or preparing for ESL- essential 
positions with companies engaged in international commerce, 
pattern of TOEIC/LPI concurrent correlations was consistent across 
samples and there was relatively close fit between sample LPI means 
and estimates from TOEIC scores, especially TOEIC-LC, using combined- 
sample regression equations. Theoretical and pragmatic implications 
of the findings are discussed. General guidelines are provided for 
making inferences about LPI-assessed level of "a 1 E ^ 1S £ *5°* 

ficiency from TOE T C scores. Directions are suggested for lurther 
research and development activities in the TOEIC testing context. 
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Summary 



A generic problem with norm-referenced second- language profici- 
ency tests is that examinees' scores on the tests do not provide a 
direct indication of their actual levels of functional ability to use 
a target language as demographically comparable native - speakers can be 
expected to use it. The functional implications of scores on sucn 
tests must be established empirically by conducting criterion-related 
validity studies designed to link level of performance on specific 
tests to level of performance on criterion measures of ability to use 
English (or other target language), operationally defined in some 
acceptable sense, based on direct observation and evaluation of the 
defined behavior. 

The criterion observations may be either "context - specif ic" 
(e.g., samples of business correspondence, observation of communi- 
cative interaction with native speakers), or "context- independent” 
(e.g., ability to use English in face-to-face conversation, directly 
assessed using the Language Proficiency Interview [LPI] procedure that 
results in ratings of oral language proficiency according to inherent- 
ly meaningful, behaviorally defined levels). 

The Test of English for International Communication (TOEIC) , 
developed by Educational Testing Service (ETS) , is a multiple -choice , 
norm- ref erenced test designed to measure the English-language lis- 
tening comprehension (LC) , and reading (R) skills of individuals for 
whom English is a second language (ESL) . The TOEIC is used primarily 
by corporate clients, worldwide; the majority of clients are located 
in Japan, as are about 80 percent of all TOEIC examinees. In Japan 
and in several other countries, TOEIC affairs are administered by 
local representative offices; elsewhere the TOEIC Is available through 
the TOEIC-ETS (Princeton) office. 

This study was undertaken to develop and evaluate guidelines for 
making inferences about level of oral English proficiency from TOEIC 
scores. Level of performance on the TOEIC was referenced to LPI 
ratings, for samples of examinees from representative test-use 
settings in Japan, France, Mexico, and Saudi Arabia, using the 
familiar regression model. 

LPI ratings were regressed on TOEIC-LC, TOEIC-R, and TOEIC-Total 
respectively, and on TOEIC-L and TOEIC-R (as a battery of predictors), 
in data-sets obtained under operational conditions. Study data were 
obtained during the course of comprehensive ESL proficiency assess- 
ments conducted by TOEIC- trained interviewers/raters in representative 
TOEIC-use settings in Japan, and by TOEIC/ETS staff members, including 
the staff member responsible for providing training in the LPI 
technique in Japan and elsewhere. 

Across four Japanese subsamples (N -= 42 through N - 142, combined 
N ~ 285), coefficients for Total/LPI ranged between .71 and .80, 
TOEIC-LC/LPI correlations ranged betweeen .67 and .80, and T0EIC- 
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R/LPI coefficients were slightly lower (as expected on theoretical 
grounds), ranging between .65 and .72. Similar patterns of rela- 
tionships were found in data-sets for samples of TOEIC examinees in 
France (N - 56), Mexico (N « 42), and Saudi Arabia (N = 10). The 
correlational findings indicated that inferences about LPI performance 
based solely on TOLIC-LC were essentially as valid as inferences based 
on lOEIC-Total, the simple sum of LC and R, or on regression weighted 
composites of LC and R. Results of a residual analysis indicated that 
the fit between observed and estimated criterion means was more 
consistent across the four national samples, when LPI was estimated 
from a combined- sample regression equation us. ng only the TOEIC-LC 
score, than from a combined- sample equation using the Total-score. 
These findings are evaluated from both theoretical and pragmatic 
perspectives. 

Study findings suggest, as a strong working hypothesis, that 
level of ability to use English in face-to-face conversation (indexed 
by LPI performance) will vary relatively consistently with level of 
developed English-language listening comprehension (indexed by TOEIC- 
LC scores), across as well as within samples of educated, academically 
trained ESL users/learners likely to be tested with the TOEIC in 
diverse national TOEIC subpopulations. Guidelines for making inter- 
pretive inferences about levels of oral English proficiency from TOEIC 
scores are developed, and evaluated from theoretical and pragmatic 
perspectives. Attention is called to the problem of relating tested 
levels of proficiency to levels of on-the-job erformance in positions 
that require the use of English, and to the problem of setting 
"minimum proficiency requirements." 

It is concluded that by its initiative in encouraging and 
facilitating the use of the well-established LPI (direct assessment) 
procedure in operational testing contexts, the TOEIC program has made 
it possible to develop general guidelines that permit test users to 
make statistically valid inferences from TOEIC scores about levels of 
oral English proficiency. Furthermore, this initiative has made it 
possible to develop better- informed perspective regarding the level 
and range of developed oral English proficiency relative to expecta- 
tion for an educated native speaker in the population of ESL users/ 
learners likely to be tested with the TOEIC. These are interpretive 
inferences that cannot be drawn from knowledge of distributions of 
standard scores on norm- referenced tests, alone. 
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Section I: INTRODUCTION 



A generic problem with norm- referenced tests of second- language 
proficiency is that the test scores do not provide any direct 
indication of actual levels of functional ability to use the target 
language(s) involved. As noted by Ingram (1985: 237), for example, 

[norm-referenced tests serve primarily] to discriminate amongst and 
rank-order learners and the learner's proficiency level is measured 
in relation to the performance of other learners, i.e., all one can 
directly say about the results of such tests is that on Test X 
Learner A was better or worse than Learner B or than n% of the other 
learners who took the test. 

The interpretive problem has been succinctly summarized by 
Carroll (1967: 2), as follows: 

Except to the extent that one can guess at the range of competence 
possessed by a reference group, a percentile rank [on norm-refer- 
enced tests] cannot tell, for example, how successful the individ- 
ual would be in communicating with a native speaker of the lan- 
guage or in comprehending the substance of printed materials in the 
language . 

Thus, for example, knowledge of an examinee's standard scores or 
percentile ranks on a norm-referenced test such as the Test of English 
as a Foreign Language or TOEFL [ ETS , 1985a], permits no direct infer- 
ences regarding the nonnative- speaker ' s functional ability to use 
English as a second language (ESL) or as a foreign language (EFL)--for 
example, to engage in a communicative dialogue with native- English 
speaking students or faculty members . 

Accordingly, norm- referenced tests of English language macro- 
skills (e.g., listening or reading), or components of such skills 
(e,g., vocabulary), or knowledge of grammar, and so on, are referred 
to as indirect measures of "real-life language activities" (e.g., 
Clark, 1975: 10-11). 

The functional implications of scores on such tests must be 
established empirically by conducting criterion-related validity 
studies designed to link level of performance on specific tests to 
level of performance on criterion measures of ability to use the tar- 
get language, operationally defined in some acceptable sense, based on 
direct observation and evaluation of pertinent behavior. As noted by 
Clark (1975, 1978), for example: 

The usefulness (of indirect, norm- referenced tests) does not . . 
depend on the tests' face/content validity but on the extent to 






See notes at end of text. 




1 



which the test scores are found to correlate, on a statistical 
basis, with more direct measures of the proficiency in question 
(1978a: 27); (and) . . .the validity of indirect procedures as 
measures of real life proficiency is established through 
statistical - - sped f ically correlational - -means (1975: 11). 

Although it seems clear that this is so, surprisingly little 
attention has been given to the exposition, evaluation, and applica- 
tion of models for conducting the types of criterion-related validity 
studies needed to establish the general level and consistency of con- 
current relationships between scores on particular norm- ref erenced 
tests and specified criteria of ability to use English (or any other 
target language) - -either general ” context- independent 11 language-use 
criteria (for example, direct assessments of oral language proficiency 
in a controlled interview situation) or diverse "context- specif ic" 
criteria (reflecting observation ai . evaluation of ability to meet 
linguistic demands in various "real-life" work or study contexts) . ^ 

Thus, there is little direct precedent for the study reported 
herein- -a study undertaken to establish interpretive guidelines for 
the Test of English for International Communication (TOEIC) , by 
"calibrating" (a) scores on this norm- referenced ESL proficiency test 
to (b) behaviorally defined levels of "functional ability to use 
English in face-to-face conversation" (assessed formally in structured 
conversational interviews), treated as (c) a general, "context- 
independent" criterion variable, in (d) samples of TOEIC examinees 
from representative TOEIC-use settings in Japan and elsewhere. 

For the present it is sufficient to establish the following 
points : 

1. The TOEIC is a multiple- choice , norm- referenced test, with sec- 
tions measuring English language listening comprehension and reading 
ability. The TOEIC testing program, developed and generally admin- 
istered by Educational Testing Service (ETS) , serves primarily corp- 
orate employers outside the United States who need to make English- 
prof iciency-related personnel selection, placement, and/or training 
decisions (see, for example, ETS, 1982a, 1985b, 1986a, 1986b, 1988). 

2. Functional ability to use English in face-to-face conversation 
was assessed using the well-established direct Language Proficiency 
Interview (LPI) procedure developed by the Foreign Service Institute 
(FSI) of the U.S. Department of State. Language Profiency Interivew 
is only one of several recognized designations for this direct oral 
language proficiency interview procedure, referred to originally as 
the Foreign Service Institute (FSI) Oral Proficiency Interview 
(OPI) . The procedure has also been designated as the Interagency 
Language Roundtable (ILR) Oral Interview, reflecting the fact that 
it has been adopted by a number of U.S. governmental agencies, known 
collectively as the Interagency Language Roundtable (see Lowe, 
1987). In the TOEIC testing context, the interview procedure is 
widely known as the LPI procedure. Regardless of the designation 
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applied, this direct assessment procedure generates ratings of oral 
language proficiency according to inherently meaningful (behavioral - 
ly defined) levels on an “absolute proficiency scale* ranging from 0 
(no proficiency) through 5 (proficiency equivalent to that of an 
educated native speaker [ENS] of the target language). 

3. The familiar regression model was employed to calibrate (ref- 
erence, link) scores on the arbitrarily defined TOEIC standard score 
scale to directly interpretable levels of LPI performance (treated 
as a “context- independent*' language use criterion measure) in 
samples of TOEIC examinees in Japan and elsewhere. 

Although it has been infrequently applied, the concept of setting 
general functional guidelines for the interpretation of norm-refer- 
enced second- language proficiency tests by calibrating the test scores 
to the directly interpretable LPI scale is logical, and it has strong 
empirical precedent. In a benchmark study of the attainments of for- 
eing language majors in the United States, Carroll (1967) used a sim- 
ple equating model to establish equivalencies between (a) scores on 
norm- ref erenced tests (of basic macroskills in French, German, Rus- 
sian, and Spanish) and (b) LPI-scaled conversational interview ratings 
(and comparably scaled ratings of functional reading proficiency in 
the target languages), using data for (c) samples generally represen- 
tative of the focal populations. 



The elemental significance of the concept of enhancing the 
interpretation of norm- referenced languge proficiency tests by 
referencing (calibrating) test scores to inherently meaningful, 
behaviorally- scaled direct proficiency measures -- the basic concept 
embodied in Carroll's 1967 study design (the Carroll model) - -appar- 
ently has not been generally recognized. In reviewing the research 
literature, for example, the writer was unable to find an extended 
discussion of the Carroll model, and no directly comparable study 
involving norm-referenced second- language proficiency tests appears to 
have been conducted in the United States.^ 



In circumstances such as those described, it is important to 
provide general context and perspective for the empirical study con- 
ducted in the TOEIC testing context along lines sketched above, by 

1. considering briefly two complementary approaches to the design 
of cr iter ion- related validity studies concerned with enhancing the 
interpretation of norm- referenced seond- language tests, namely, 
studies involving "context- specif ic" (real-life) language use 
criteria, and studies involving general "context- independent " criteria 
(such as performance in Language Proficiency Interviews, the criterion 
employed in the study) ; 

2. examining properties of the LPI that seem logically to 
establish the relevance of functionally scaled LPI behavior as a 
"context- independent" criterion for use in setting interpretive 
guidelines for indirect measures; 
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3. reviewing in some detail the pioneering study by Carroll 
(1967), and a large-scale study (Hilton, Grandy , Kline, & Liskin- 
Gasparro, 1985) in which self-assessments of oral language proficiency 
were calibrated to LPI ratings in samples of teachers of Spanish and 
French in the U . S studies that yielded important interpretive 
benefits by referencing scores on indirect measures (both test and 
nontest) to behaviorally scaled, direct measures of language profic- 
iency, using simple equating models : and 

4. detailing the advantages of a regression-based approach (over 
the equating approach) to calibrating the arbitrarily defined scales 
of norm-referenced, indirect proficiency measures to directly inter- 
pretable proficiency levels, using LPI-performance as a "context- 
independent" criterion variable. 

A review of these elemental considerations is provided in Section 
II to establish the conceptual and methodological rationale for the 
empirical study in the TOEIC testing context that involved the use of 
a regression-based model for developing and evaluating the usefulness 
of guidelines for inferring (estimating) LPI performance from scores 
on the TOEIC, in samples of educated, adult ESL users/ learners from 
representative TOEIC-use settings (places of work or work-related 
intensive ESL training) in Japan, France, Mexico, and Saudi Arabia, 
using TOEIC/LPI data-sets generated during the course of comprehens- 
ive, operational on-site ESL assessments. 

Study findings, in samples from the majority test- taking subpop- 
ulation in Japan, and in samples from three additional national test- 
taking subpopulaticns , indicate that clear interpretive benefits were 
realized by referencing scores on the TOEIC to LPI performance. The 
findings and other evidence reviewed in the study suggest, as a 
working hypothesis, that the pattern of TOEIC/LPI relationships 
observed in the study sample is likely to be relatively consistent 
across nationally and linguistically diverse samples of educated, 
adult ESL users/learners who are likely to be tested with the TOEIC . 
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Section II. ALTERNATIVE EMPIRICAL MODELS FOR ENHANCING THE 
INTERPRETATION OF NORM -REFERENCED TESTS 

For purposes of the present paper it is useful to consider 
briefly the principal characteristics of two complementary types of 
criterion-related validity studies, namely, studies designed to relate 
scores on norm-referenced (indirect) tests to "context-specific" 
criteria of ability to use a target language, and those designed to 
relate the test scores to "context- independent" criteria. 



Relating Indirect Tests to Context- Specif ic Criteria 

In studies involving context- specif ic criteria, the aim is to 
relate scores on norm- ref erenced tests to criteria that reflect func- 
tional ability to use a target language in specific settings (e.g., 
places of work or study) to perform language - essential tasks (e.g., 
discuss business affairs with native speakers or participate in an 
academic seminar; write business letters or term papers). Perform- 
ance might be assessed by native- speaking supervisors, colleagues, or 
clients. Studies of this type have been characterized as "ultimate 
pragmatic validity studies" (Ingram, 1985: 238). 

By linking score levels on the tests to context- specif ic criter- 
ia, lo~al users can obtain the type of evidence that is needed to help 
them form realistic (actuarially based) expectations about the type or 
level of on-the-job language proficiency likely to be exhibited by 
individuals at different score levels on the test under consideration. 
The context- specif ic approach provides interpretive guidelines that 
are locally meaningful. 

However, factors that make context- specif ic studies valuable for 
local test users tend to militate against generalization. For exam- 
ple, many replications of studies involving particular tests and 
criteria would be needed to assess the stability of relationships 
across contexts. Moreover, context- specif ic criteria, like the 
indirect test being "pragmatically validated," are likely to involve 
only a relativistic classification of the linguistic behavior being 
evaluated (e.g., superior, average, below average; satisfactory versus 
unsatisfactory). Thus, despite their local pragmatic value, the 
results would not contribute directly to improved understanding of the 
general levels or types of "ability to use a target language" that 
individuals at specified score levels on the norm- ref erenced test may 
be expected to exhibit.^ 

Finally, it is difficult for professionals to design and conduct 
rigorous followup studies; local test-users are likely to find it even 
more difficult (conceptually and logis tically) to do so. Consequent- 
ly, as Ingram (1985: 238) has noted, "... few if any adequate stud- 
ies exist relating indirect tests to real life or workplace use of the 
language . 



Generally speaking, in pragmatic validation involving context- 
specific criteria, the questions at issue have to do with whether 
individuals at given score levels on a test are linguistically qual- 
ified for particular ESL-essential jobs, with how adequately they 
perform the ESL aspects of their work, with identifying minimally 
acceptable standards, and so on. 



Relating Indirect Tests to Context- Independent Criteria 

In order to obtain more general answers to questions about the 
functional implications of scores on particular indirect proficiency 
tests, it is necessary to conceptualize and conduct studies employing 
"context- independent " criteria , 

A context -independent criterion may be defined as a measure of 
ability to use the target language in circumstances resembling those 
likely to be routinely encountered in many different "real- life" 
language-use contexts (e.g., situations requiring the exchange of 
meaning in conversational interaction). In context- independent 
studies, the aim is to assess the relationship between scores on 
particular norm- referenced tests and level of performance on one or 
more criteria of "functional ability" to use a target language, rated 
according to a scale involving behaviorally defined levels. 

Given scores on the clearly defined, functionally scaled cri- 
terion variable and scores on an indirect, norm- referenced test for a 
representative sample from a defined population of second- language 
users, application of the familiar regression model would make it 
possible to translate (calibrate) scores on the arbitrarily defined 
scale of the norm-referenced test into estimated scores on the 
behaviorally scaled, hence directly interpretable, criterion through a 
regression (calibrating) equation. Questions regarding the applica- 
ability of the resulting regression equation across subpopulations of 
interest (e.g., different native- language subgroups) can be addressed 
empirically (e.g., through residual analyses designed to assess 
average discrepancies between observed standing on the functional 
criterion and estimated standing based on the general regression 
equation). 



Behavior in Language Proficiency Interviews 
as a Context- Independent Criterion 

The LPI model was developed for use in assessing the linguistic 
readiness of personnel to undertake assignments with the U. S. govern- 
ment in posts requiring particular levels of oral language proficiency 
In languages other than English. The levels specified by the model 
are used to characterize, in functional terms, both the individual's 
performance and the demands of particular positions. The LPI proced- 
ure has been adopted without basic modification by nongovernmental 
institutions and agencies in both public and private settings for 
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purposes of evaluation (of second- language training) and certifica- 
tion (e.g., of second- language teachers) 

The behavior that is elicited under controlled conditions and 
systematically rated in the Language Proficiency Interview appears to 
be similar to the type of behavior that is elicited (and evaluated or 
judged informally) in a variety of real-life contexts. The following 
description of the procedures is provided by Clark and Swinton (1979): 

The interview consists of a face-to-face conversation of approx- 
imately 15-25 minutes between the examinee and a trained inter- 
viewer who is a native or near-native speaker of the test lan- 
guage. The conversation begins at a fairly simple level and becomes 
increasingly more sophisticated linguistically, as reflected in 
increased rate of speech on the part of the interviewer, use of more 
complex structures and more specialized vocabulary, up until the 
point at which the examinee is no longer able to hold his or her own 
in conversation at that level. At this point, the level of sophis- 
tication of the conversation is reduced somewhat and the examiner 
spends several minutes exploring the examinee's breadth of command 
of grammatical structures (for example, ability to use past and 
future tense forms, conditional constructions, etc.); and extent of 
active vocabulary, as elicited by questions probing a variety of 
topical areas including personal and family background, work 
activities, studies, hobbies and free- time activities, future plans, 
and so forth. With more proficient examinees, the interviewer will 
also broach political, social, economic, or other topics requiring 
very high levels of language use. The interview continues until the 
examiner is satisfied that the interviewee has fully demonstrated 
the highest level of speaking proficiency of which he or she is 
capable (p . 5) . 

The LPI "Absolute Proficiency Scale” 

The most distinctive feature of the LPI model is the scaling 
frame of reference employed. The behavior assessed is classified by 
trained interviewers/raters according to levels ranging from "0" 
(indicating no functional language-use ability in the 'situation) 
through "5” (indicating functioning equivalent to that of an educated 
native speaker [ENS]). 

Each of six principal points (0 , 1 , 2 , 3 , 4 , 5 ) on this "quasi- 
absolute proficiency scale" (to use Carroll's [1967: 2] modification 
of the typically employed term "absolute”) is "anchored" behaviorally . 
Each level is characterized by a clearly defined pattern of language- 
use behavior. In traditional score-reporting practice, for levels 0-4 
a ”+" is added to a level -rating for individuals whose performance is 
judged to substantially exceed that for a given level, but not to meet 
fully the requirements for the next higher level. In analyses 
requiring numerical conversions, the plus ratings are designated by 
adding .5 to the level whose requirements have been met fully (04* * 
.5, 1+ = 1.5, and so on).® Detailed descriptions of the type of 
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linguistic behavior associated with each of the LPI levels (basic and 
intermediate) is provided in Appendix A. 



Interpreti/e inferences from rated levels . The types of inter- 
pretive inferences associated with each of the basic levels of the LPI 
scale have been succinctly summarized in the form of a "functional 
trisection , " (shown as Exhibit A), conceptualized by an agency of the 
U.S. government (see ETS , 1982: 21). The trisection characterizes 
each level as to type of functional ability demonstrated, content 
areas covered, and level of accuracy of language use in the interview 
situation. Level 3 has come to be accepted as indicating "minimum 
professional mastery" of a second language. From a normative per- 
spective, according to one informed observer of the LPI procedure 
(Jones, 1978: 93), circa 1978 (a) there were very few examinees above 
Level 3, and (b) very few language-essential positions within the U.S. 
government that were designated as requiring proficiency higher than 
Level 3. 

LPI Performance as a General Context- Independent Criterion 

It has been suggested by second- language assessment experts 
(e.g., Clark, 1975) that face-to-face conversation approaches real- 
life communication about as closely as is possible in a test situ- 
ation. 

The relevance and utility of formally elicited and evaluated 
interview behavior as a "surrogate" for direct observation and 
evaluation of ability to exchange meaning conversationally in 
situations that arise naturally in a variety of workplace, academic, 
or other language-use contexts appears to be inferrable directly from 
the procedures described above. Moreover, substituting the controlled 
interview for observation and assessment of the ". . . operational 
(language -use) capability of a man on the job" was proposed by 
Francis Cartier (1975), a discussant of Wilds' (1975) frequently cited 
seminal paper describing the development and use of the "oral inter- 
view test." Noting problems involved in obtaining real-life criteria, 
Cartier commented as follows: 

Let me point out that without at least metric access to the 
criterion situation, we have [in the structured interview] what we 
must call a surrogate criterion. We would like to, for example, 
correlate paper and pencil tests with interviews, and the reason we 
would like to do that is that the interviewer gives us this kind of 
surrogate criterion which we have to use simply because we can't 
apply any sort of metric to the criterion population and situation 
(Cartier, 1975: 12). 

It is perhaps obvious that 
provide a basis for simulating 
possible "real life" contexts. As 
6 ): 
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the interview situation does not 
language-use requirements in all 
noted by Clark and Swinton (1979: 



* 



Exhibit A 



FUNCTIONAL TRISECTION OF ORAL PROFICIENCY LEVELS* 



Oral Profic- 
iency Level 


Function 


Context 


Accuracy 




(Tasks accomplished, 
attitudes expressed 
tone conveyed) 


(Topics, subjects areas, 
activities and jobs 
addressed) 


(Acceptability, quality 
and accuracy of message 
conveyed) 


5 


Function equivalent 
to an educated native 
speaker (ENS) . 


All subjects. 


Performance equivalent 
to an ENS. 


A 


Able to tailor 
language to fit 
audience, counsel, 
persuade, negotiate, 
represent a point of 
view and interpret 
for dignitaries. 


All topics normally 
pertinent to pro- 
fessional needs. 


Nearly equivalent to an 
ENS. Speech is extensive, 
precise, appropriate to 
every occasion with only 
occasional errors. 


3 


Can converse in 
formal and informal 
situations, resolve 
problem situations, 
deal with unfamiliar 
topics, provide ex- 
aminations, describe 
in detail , offer 
supported opinions, 
and hypothesize. 


Practical, social pro- 
fessional and abstract 
topics, particular in- 
terests, and special 
fields of competence. 


Errors never interfere 
with understanding and 
rarely disturb the native 
speaker. Only sporadic 
errors in basic 
s tructures . 


2 


Able to fully parti- 
cipate in casual 
conversations, can 
express facts, give 
instructions , 
describe, report, 
and provide narra- 
tion about current, 
past and future 
activities . 


Concrete topics such as 
own background, family, 
interests, work, travel, 
and current events . 


Understandable to native 
speaker not used to deal- 
ing with foreigners. 
Sometimes miscommunicates . 


1 


Can create with the 
language, ask and 
answer questions, 


Everyday survival 
topics and courtesy 
requirements . 
participate in 
short conversations. 


Intelligible to native 
speaker used to dealing 
with foreigners. 


0 


No functional 
ability. 


None . 


Unintelligible . 



*From ETS (1982b) 
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An obvious shortcoming is that the interview setting cannot directly 
reproduce the [great variety of] physical surroundings in which the 
examinee would be expected to perform in real life. . . (T)he 
psychological and affective aspects of real-life communication, 
including motivation and communicative intent of the speakers, 
status roles of interviewer and interviewee, and a number of other 
aspects of the real-life situation cannot be precisely duplicated in 
the interview setting (Clark & Swinton, 1979: 6). 

Limitations of this kind may be thought of simply as inherent 
constraints in generalizing "level of second- language conversational 
ability" from the structured interview results - -results that should 
not be expected to be equally predictive of functioning (language -use 
criteria) in every real-life situation involving the exchange of 
meaning through conversation. After all, on the basis of Cartier's 
succinct conceptualization of the issue, we may say that the 
controlled interview technique is useful as a surrogate criterion for 
referencing scores on indirect tests precisely because it is not 
possible to measure language -use behavior generally conceived 

Acceptance of LPI performance as a surrogate for "real-life" 
performance criteria for validating "paper and pencil" (multiple- 
choice, norm- referenced , indirect) tests does not obviate the need for 
pragmatic validation of the surrogate criterion itself, using some 
measure of operational capability of a man on the job" (especially of 
the language-use dimensions of such capability) as a criterion. ^ 

Reliability Considerations 

The reliability of the LPI -criterion was not directly at issue in 
this criterion- ref erencing study. 

High reliability in a criterion measure is convenient but not 
critically important. Low reliability in a criterion measure merely 
attenuates all its relationships with other measures" (Thorndike 
1949: p. 127) . 

However, there is a significant body of empirical evidence 
bearing on the reliability of the LPI procedure as administered within 
the U.S. government (e.g., Adams, 1978; Clark, 1978a, 1978b passim) 

and elsewhere (e.g., Clark, 1978c; Clark and Swinton, 1979; Hilton at 
al. 1985). 

For purposes of the present study, it is useful to call attention 
to certain general conditions that have been found to affect the 
reproducibility of LPI ratings that is, consistency with regard to 
both rank- order and level in rating LPI performance. The number of 
raters, of course, is a generic reliability- related factors - -relia- 
bility tends to increase as the number of raters increases , 

Apart from this generic consideration, the reproducibility of 
ratings, by raters trained in the LPI procedure, is enhanced when all 
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raters "share the same roof." As noted by Adams (1978: 35), the 
system works best with all interviewers "... under one roof, able to 
consult with each other . . . and most apt to break down . . . when 
examiners are isolated." Limited empirical evidence of the effects of 
"same site" versus "scattered site" conditions on the reproducibility 
of LPI ratings tends to support Adams' observation. 

Clark (1987) compared ratings (in French and German, respective- 
ly) for the same interviewees by interviewer/raters in three different 
U.S. government agencies (Clark did not exami—s intra-agency reliabil- 
ity) . 

(Although) the ratings assigned did not differ across agencies in a 
statistically significant way . . . examination of the rating per- 
formance for various sub-portions of the proficiency scale showed 
fairly clear across-agency differences . . . primarily at the lower 
and middle ranges of the scale" (p. 145). 

Bejar (1985) found that reliability of ratings of samples of ESL- 
speaking behavior, represented by taped recordings of items from the 
Test of Spoken English (ETS , 1985c), improved when "same site" condi- 
tions were introduced. 



The Carroll Model 



Like any assessment procedure that involves direct observation 
of individual behavior, and clinical or subjective evaluation of 
observed behavior samples, the LPI model is too costly and cumbersome 
to administer to be considered as the primary instrument in large- 
scale programs for which the multiple-choice, indirect, norm-refer- 
referenced test is admirably suited. However, ns indicated earlier, 
Carroll (1967) recognized that the interpretive power of behavicrally 
anchored, direct assessment procedures such as the LPI (and parallel 
procedures for assessing reading or writing skills) could be extended 
to populations of interest by empirical linkage to related, norm- 
referenced tests, using linkage rules developed in samples from the 
populations . 

No other large-scale studies using the basic Carroll (1967) model 
to calibrate norm-referenced test scores to the LPI scale appear to 
have been conducted in the United States. However, the Carroll model 
was employed to calibrate self-ratings of oral language proficiency to 
the LPI scale in a national study of the oral language proficiency of 
teachers of French and Spanish in the U.S. (Hilton, Grandy, Kline, & 
Liskin-C isparro, 1985). For purposes of the present study, therefore, 
it is quite important to examine Carroll's conceptual and methodolog- 
ical approach, as well as illustrative findings from both of these 
national studies, in some detail. 
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Caroll's (1967) "Calibration" Study 



Carroll was interested in assessing the foreign language skills 
of language majors near the end of their senior year in college and, 
incidentally, in developing national norms for a series of proficiency 
tests known as the Modern Language Association Proficiency Tests 
(MLAPT) in French, German, Italian, Russian, and Spanish. Each norm- 
referenced test included measures of listening comprehension, speaking 
(scored by trained judges), reading, and writing (a free response 
"cloze" type of test, scored by trained judges). 

To address the problem of inferring language-use ability from 
scores on the norm-referenced MLAPT, a "calibration substudy" was 
conducted. The purpose of this substudy was 

11 . . . to ascertain correspondences between MLA Proficiency Test 
scores and the 'absolute proficiency ratings' rendered by expert 
teams from the Foreign Service Institute of the U.S. Department of 
State- -(that is) to calibrate the scores on the MLA Proficiency 
Tests in terms of 1 quas i- absolute , * inherently meaningful standards" 
(Carroll, 1967: 2). 

To establish the correspondence between the test scores and 
ratings (LPI or Speaking [S] and Reading [R] ) , Carroll initially hoped 
to obtain data for samples of about 50 cases in each language. 
Ratings were finally obtained for somewhat smaller samples composed of 
participants in summer language institutes (attended teachers and 
advanced students) . The basic data generated in the equivalency 
substudy are summarized in Exhibit B: Carroll's (1967) Table 2.2, 
showing n's, means, standard deviations, and intercorrelations of 
MLAPT scores and ratings for the respective calibration samples 
(reprinted by perrrission of the Harvard Graduate School of Education). 
A brief description of the levels for Speaking (that is, LPI levels) 
and the corresponding levels for Reading (R-scale) is included in the 
table . 

Carroll made a decision to link MLAPT speaking and listening 
scales to the Speaking scale (S- scale) , and the MLA reading and 
writing scales to the Reading scale (R-scale). The variables were 
relatively highly intercorrelated. ^ 

Regarding this decision, Carroll made the following observation: 

The correlations between the two FSI ratings, S and R, are quite 
high . . . Save possibly in the case of French, there is little 
evidence in the FSI -MLA correlations to suggest that FSI Speaking 
[LPI] ratings are more highly correlated with MLA Listening and 
Speaking scores than with MLA Reading and Writing scores, nor that 
FSI Reading ratings are more correlated with MLA Reading and Writing 
scores than they are with Listening and Speaking scores. Neverthe- 
less, on an a priori basis [the linkage pattern designated above was 
followed] (Carroll, 1967: 13). 
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Exhibit B 



Data for the Calibration Samples: Table 2.2 from Carroll (1967: 14) 



TOUCH CEMMN EUSSIAN SPANISH 



No. tested 


39 




39 




19 




30* 




Teat 


Mean 


S.D. 


Mean 


S.D. 


Mean 


S.D. 


Mean 


S.D. 


HLA List. 


47.38 


6.07 


45.62 


7.93 


41.84 


5.48 


44.57 


5.71 


rt Speak. 


82.97 


9.84 


97.90 


19.83 


85.00 


10.92 


84.87 


9.49 


" Read. 


50.41 


7.82 


51.59 


11.03 


33.16 


8.96 


43.77 


5.94 


" Write 


49.05 


8.10 


55.51 


14.26 


56.32 


9.31 


52.83 


10.72 


FSI Speak. * 


2.62 


.64 


3.13 


1.08 


1.97 


.66 


2.58 


.75 


FSI Read. * 


3.15 


.66 


3.10 


1.10 


1.89 


.57 


2.86 


.66 






Correlations with FSI 


Ratings 










"S" 


"R M 


"S" 


"R" 


"s" 


f 'R" 


"S" 


"R" 


MLA T-lst . 


.67 


(.61) 


.73 


(.72) 


.84 


(.75) 


.73 


(.80) 


" Speak. 


.67 


(.49) 


.82 


(.83) 


.78 


(.66) 


.66 


(.65) 


M Read. 


(.58) 


.71 


(.82) 


.82 


(.78) 


.69 


(.63) 


.74 


" Write 


(.65) 


.63 


(.86) 


.84 


(.62) 


.71 


(.70) 


,77 


FSI "R" 


(.69) 




(.95) 




(.90) 




(.80) 





* In computing these values, a "+" ia given a value of .5. Thus, 1+ la coded 1.5, 
2+ » 2.5, etc. For the meanings of the FSI ratings, see below. 



Native or S-5 

bilingual 

proficiency 

Full S-4 

professional 

proficiency 

Minimum S-3 

profesalonal 
prof iclency 



Limited S-2 

working 

prof iclency 

Elementary S-l 
prof Iclency 



Speaking proficiency equivalent to that 
of an educated native speaker. 



Able to use the language fluently and 
accurately on all levels normally 
pertinent to profesalonal needs. 



R-5 Reading proficiency e- 
qulvalent to that of an 
educated native speaker. 

R-4 Able to read all styles 
and forma of the language 
pertinent to profession- 
al needs. 



Able to speak the language with suffi- 
cient structural accuracy and vocabulary 
to satisfy representation requirements 
and handle professional discussions 
within a special field. 



R-3 Able to read non-techni- 
cal news items or tech- 
nical writing in a 
special field. 



Able to satisfy routine social 
demands and Halted office 
requirements. 



R-2 Able to read inter- 
mediate leason material 
or simple colloquial 
texts. 



Able to satiafy routine travel needs R-l Able to read elementary 

and minimum courtesy requirements. leason material or 

common public signs. 



"All the ratings except the S-5 and R-5 teay be modified by a 
plus (+) , indicating that proficiency substantially exceeds the 
minimum requirements for the level Involved but falls short of 
those for the next higher level." 



— Extracted from "Absolute Language Proficiency Ratings," 
Circular, May 1963, Foreign Service Institute, 
Washington, D.C. 
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On the basis of the data summarized in Exhibit B, MLAPT scores 
were translated to either the LPI/S-scale or the corresponding scale 
for reading (the R-scale, for which general descriptors are provided 
in Exhibit B) , in the pattern indicated above, by the method of "equal 
standard scores": fzfx) - fX - X*1 / SDx) - fz(y) - [Y - Y*1/ SDv) , 
where X and Y are observed scores, X* and Y* are means, and SDx and 
SDy are standard deviations of the respective variables. 

Carroll commented on his use of an equating model for calibrating 
the MLAPT scores to the functional scales of the criterion variables, 
as follows: 

(The equating approach) . . . merely assumes that X and Y are equal- 
ly estimates of the same thing, and that it is an arbitrary matter 
whether one measures this thing by X or by Y . . . The more X and Y 
are correlated, the more this procedure is justified. It is felt 
that in the present case, the corresponding measurements are suffic- 
iently well correlated to justify the procedure, particularly in 
view of the fact that the purpose of the study was merely to estab- 
lish meaningful standards for the Interpretation of MLAPT scores 
(Carroll, 1967: 15, emphasis added). ^2 

The observed means and standard deviations of the ratings in the 
calibration samples permitted (probably for the first time) empiric- 
ally based inferences about the level of development of significant 
aspects of functional ability to use particular target languages in 
particular populations of second language learners (relative to 
expectation for educated native - speakers ) . By calibrating MLAPT 
scores to the inherently interpretable scales of the direct assess- 
ments, Carroll and his associates were able to extend these inferences 
to the populations of interest. 

For purposes of the present study it is useful to examine selected 
patterns of findings that point up the interpretive contribution of 
referencing scores on the norm- ref erenced MLAPT- series to the func- 
tionally defined LPI (Speaking) scale and the conceptually comparable 
Reading scale. 

Interpretive contribution- -some illustrative examples . In eval- 
uating results for college-senior- level majors in the respective lan- 
guages, Carroll (1967: 46), commented as follows: 

Taking the results at their face value . . . we find that a general 
characteristic of the tested samples is that they are much poorer in 
Listening and Speaking skills than they are in Reading and Writing. 

This pattern is evident in Figure la (showing the pattern of 
functionally scaled equivalents of the MLAPT medians for the four 
groups of majors), and Figure lb (showing relative frequency dis- 
tributions for estimated functional Speaking (LPI) and Reading levels 
for French majors only . ^ The level indicating "minimum professional 
proficiency" (Level 3) is highlighted. 
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The findings reflected in these figures permit inferences about 
the differential development of functional second-language listening 
and reading skills in the populations of interest. It is important to 
recognize that inferences about differential levels of functioning 
with respect to specified language skills in a given population of 
nonnative speakers cannot be derived from consideration of average 
standard scores on norm- referenced tests of listening and reading 
skills in representative samples of users/learners from the intended 
population of test takers. In order to make such inferences, it is 
necessary to contrast the performance of samples from the focal (test- 
standardization) population on the respective measures with that of 
native speakers of the target language. 

This point is reinforced in Figures 2a, 2b, & 2c. The figures 
are based on score -distributions reported by Angoff and Sharon (1971), 
who studied the comparative performance on the Test of English as a 
Foreign Language (TOEFL) of a sample of U.S. college freshmen in a 
relatively unselective college with that of a general TOEFL refer- 
ence group. TOEFL examinees typically have relatively high levels of 
academic/cognitive skills developed primarily through the medium of 
their respective native languages. 

Figure 2a shows distributions of standard scores for TOEFL 
Listening Comprehension and TOEFL Reading Comprehension for a general 
sample of TOEFL examinees (who took the five-part version of the 
test). The two distributions are identical for all practical purposes. 
This is a psychome tr ically assured phenomenon. Because both sec- 
tions were standardized in a sample from the population of interest, 
the listening comprehension items and the reading items were 
specifically selected so as to be of "average difficulty" for the 
standardization sample. This process effectively obscures any devel- 
opmental differences that may be present in a target population. 

When the performance of a group of native -English speakers on the 
respective sections is introduced as an interpretive frame of refer- 
ence, differential levels of skill development are clearly inferrable 
(cf. Figures 2b and 2c). 

In essence, the general nature of the inferences from the series 
of figures is similar to the general nature of inferences from Figure 
lb, reflecting Carroll's directly interpretable findings. It is not 
possible to draw comparable functional inferences from the distribu- 
tions of TOEFL Listening Comprehension and Reading scores for a gener- 
al TOEFL reference group, as indicated in Figure 2a. 14 This is a 
generic problem with norm-referenced tests of second- language 
proficiency . 

Calibrating Self-Assessments to the LPI Scale 

Hilton, Grandy, Kline, 6c Liskin-Gasparro (1975), with the collab- 
oration of Steven A. Stupak and Protase E. Woodford (ETS staff mem- 
bers) , conducted a study of the oral language proficiency of foreign- 
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Figure 2c. Comparotive performance of native- and non- 
native- English speakers on TOEFL Listening Cor*n prehension 




| TOE r l standard s a o r~ct internal 

[ Note. Adopted fro rn Anqott ond Sharon 
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language teachers in the U.S., in which an equating approach similar 
to that employed by Carroll (1967) was used to reference self-ratings 
of oral language proficiency to the LPI scale. 

The calibration substudv . LPI ratings were obtained for 108 
teachers of French and 113 teachers of Spanish, using recorded inter- 
views. A total of 27 field raters and eight ETS language - staff 
members were involved in ratings; at least two ratings were obtained 
for each interview. 



Average inter- rater reliabilities for the two "regular" raters 
were .71 (French) and .73 (Spanish) --a "master" rater was used when 
regular raters disagreed in level by more than one-half point. ^ 

Figure 3 shows relative frequency distributions of LPI ratings 
for the two validation samples. The distributions appear to be 
somewhat positively skewed and centered below the "minimum profes- 
sional proficiency" level (LPI Level 3) . This appears to represent a 
relatively low level of functional ability in using target languages 
in populations that may reasonably be assumed to bo highly selected in 
terms of developed functional proficiency. 

The results of an analysis of potential correlates of these 
criterion ratings suggest that many (if not the great majority) of the 
higher rated members of the validation ("calibration") samples were 
probably native speakers of the target languages, not native- English 
speakers academically trained to pursue a career in teaching French or 
Spanish . 



Of the more than 50 background variables included in the study, 
those listed below had the highest correlations with the oral language 
proficiency criterion (results are for the combined sample of French 
and Spanish teachers) : 



Correlation 
with LPI rating 



Background variable 



.57 

.53 

.52 

.41 

.33 

.30 

.29 



Parents' native language (target language) 
Birthplace (country of target language 
Native language (target language) 

Time spent in country where target language 

is the dominant language 

Speak target language at home 

Speak target language with spouse 

Spouse's native language (target language) 



These correlations suggest that the highly rated teachers prob- 
ably were native speakers of the languages under consideration. In 
sharp contrast, training- related variables were not very predictive of 
proficiency ratings: for example, college French grades (-.05), years 
of language beyond high school (.09), years teaching foreign language 
(.07), and opportunity to continue study of present language (.07). 



18 




3 *; 



Percent ot level 



Figure 3. 



LPI ratings for teachers of French and Spanish 
(adapted from Hilton et a!., 1 985) 
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0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 

Functional level: 0 (no proficiency) through 

5 ('equ/va/ent to educated nat/ve speaker) 



Note: French (N — 1 08); Spanish (N = 113) 



Note . Brief description of functional levels. 

Level 5 Function equivalent to an educated native speaker. 

Level 4 Able to tailor language to fit audience ... on 
all topics pertinent to professional needs. 

Level 3 Can converse in formal and informal situations . . . 

deal with unfamiliar topics . . . offer supported 

opinions . 

Level 2 Able to participate fully in casual conversations, 
can speak in extended discourse, express facts, give 
instructions, describe, report., and provide narration about 
current, past and future activities. 

Level 1 Can create with the language, ask and answer 
questions, participate in short conversations. 

Level 0 No proficiency 
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Global self-ratings were calibrated to the LPI scale by the meth- 
od of equipercentile equating (Hilton et al., 1985: 26-27), with the 
implicit assumption, expressed explictly by Carroll, of functional 
equivalence for study purposes. Conclusions regarding the distribu- 
tions of oral language proficiency in the general samples of teachers 
were similar to those for the calibration sample. 

Contribution of Studies Using the Carroll Model 

The findings of Carroll's benchmark study, reinforced by those of 
Hilton et al . using the Carroll model, are elementally important be- 
cause they point up the intrinsic interpretive value of the models 
developed by the Foreign Service Institute for the direct assessment 
of oral English proficiency (the LFT) and reading proficiency, and 
their corresponding "quasi-absolute 1 ' proficiency scales, when applied 
to samp^s from defined populations of second- language users/learners . 
The findings also indicate clearly that the interpretive value of 
norm- re f ercnced tests and other measures of proficiency can be 
enhanced by referencing the scales of the indirect proficiency 
measures involved to the quasi-absolute LPI scale (or to conceptually 
comparable scales for the direct assessment of reading proficiency). 

Knowledge of the distribution of LPI ratings for samples from a 
population of users/learners, provides meaningful information regard- 
ing the probable level and dispersion of a cleanly defined functional 
ability to use the language under consideration in that population. 
Such Inferences cannot be drawn from knowledge of sample distributions 
of scores on norm- referenced proficiency measures, alone. 

Carroll's (1967) pioneering study demonstrated clearly that 
knowledge of the joint distributions of ratings based on direct 
assessment procedures and norm- referenced test scores in samples 
generally representative of a focal population can be used to 
establish "linkage rules" that provide a basis for inferences about 
level of functional ability to use a target language, from examinees' 
scores on norm-referenced tests- -the principal interpretive issue. 
Hilton et al . (1985) demonstrated the generalizability of the Carroll 
model by calibrating self-assessments of oral language proficiency to 
the LPI scale. 

In both of these large-scale empirical studies, the linkage rules 
employed involved a working assumption that the indirect and the 
direct measures were functionally interchangeable for the purpose of 
providing interpretive guidelines for indirect measures. To reiterate 
Carroll's (1967) characterization, use of an equating model " . . . 
merely assumes that X and Y are equally estimates of the same thing 
and that it is an arbitrary matter whether one measures this thing by 
X or by Y. . . The more X and Y are correlated, the more this proced- 
ure is justified." 
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As shown earlier (in Exhibit B) , for Carroll's four relatively 
small calibration samples the correlations of the MLAPT Listening and 
Speaking scores with LPI (Speaking) ratings (across language groups-- 
French, German, Spanish, Russian) ranged between .66 and .84. Comp- 
arable ranges for the MLAPT Reading and Writing scores versus Reading 
ratings were, respectively, .69 to .82, and .63 to .86. Correlations 
between Speaking and Reading ratings were .69 (French), .95 (German), 
.90 (Russian), and .80 (Spanish). And, of course, Carroll concluded 
that for the purpose of setting interpretive guidelines for the MLAPT, 
levels of intercorrelation such as the foregoing were quite satisfac- 
tory . 

Advantages of the Regression Model 

Despite the demonstrably improved interpretive perspective pro- 
vided by equating models, such as those employed by Carroll (1967) and 
Hilton, et al. (1985), it is preferable to employ an approach to link- 
ing performance on indirect, norm- referenced tests to levels of per- 
formance on functionally scaled criteria that does not require the 
assumption of equivalency for working purposes. 

Given joint distributions of LPI ratings and scores on indirect, 
norm- re ferenced measures for a given sample, it is clear that a 
regression-based calibration model does not require a priori 
assumptions about the organization of second- language skills, or the 
psychometric or theoretical equivalence of the measures involved. 

At the same time, a regression-based approach to this problem 
obviously need not be atheoret ical , By regressing LPI ratings on 
measures of listening and reading skills, for example, it is possible 
to assess the hypothesis of greater correspondence between second- 
language speaking and listening skills than between speaking and 
reading skills, while at the same time establishing and evaluating 
statistically meaningful criterion-estimation rules. 

In this connection, it is noteworthy that in the regression 
model, but not in the equating model, the scales of the indirect 
measures involved are referenced (calibrated) to the functionally 
scaled criterion variable according to linkage rules that vary 
directly with the observed level of association between the indirect 
measures and the functional criterion in calibration samples. Thus, 
regression-based estimates of criterion behavior are more explicitly 
delimited than are inferences that derive from the application of 
simple equating models. And, the usefulness of the regression model 
for purposes of crJ ter ion- referencing Is well established. 

As a general proposition, regressing a functionally scaled cri- 
terion variable of the type represented by LPI performance, on indi- 
rect, norm- referenced test scores in samples of test takers from 
defined populations of second- 1 anguage user/learners , can be expected, 
a priori,' to provide evidence chat permits an informed evaluation of 
the patterns of relationships among the measures under consideration 
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from both theoretical and practical perspectives , statistically de- 
limited inferences (e.g., estimates, with standard errors), from 
scores on the indirect test, about probable level of defined language- 
use behavior, for individuals in samples from the test- taking popula- 
tion involved, and inferences regarding the probable level and 
dispersion of oral language proficiency in the test-taking population, 
according to the directly interpretable LPI scale. 
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Section III: REFERENCING TOEIC SCORES TO LPI RATINGS IN 

IN THE TOEIC TESTING CONTEXT 

The empirical study described in this section was undertaken to 
provide an assessment of efforts by the TOEIC testing program to es- 
tablish functional guidelines for interpreting scores on the Test of 
English for International Communication (TOEIC) , a norm-referenced 
test of English-language listening comprehension (TOEIC-LC) and read- 
ing skills (TOEIC-R), by relating TOEIC scores to levels of "func- 
tional ability to use English in face-to-face conversation," defined 
operationally by behavior elicited using the Language Proficiency 
Interview (LPI) procedure, and rated according to the behaviorally 
anchored LPI (oral language proficiency) scale. 

The study involved an apparently novel use (in the context of an 
operational ESL proficiency testing program) of the familiar regres- 
sion model for the purpose of referencing scores on a norm-referenced 
ESL proficiency test to LPI performance, treated as a general, con- 
text-independent criterion measure. 



The TOEIC Testing Context 

The data employed in the study were not collected for ad hoc 
research purposes, but were generated in operational test-use setting 
in Japan (where the TOEIC was introduced in 1979), France, Mexico, and 
Saudia Arabia (countries in which the TOEIC has been introduced more 
recently) . Detailed information regarding the TOEIC testing program 
and its operations is available elsewhere (e.g., ETS , 1985b, 1986a, 
1986b, 1988: 8). For perspective, a general overview is provided 
below. 

TOEIC Testing Programs 

The TOEIC is used primarily by corporate clients outside the 
United States. The majority of TOEIC examinees are tested in places of 
work, or work-related ESL training, at the behest of employers 
(instructors), in group administrations of previously administered 
editions of the TOEIC, as part of the TOEIC Institutional Program 
(IP). In Japan and Korea, the TOEIC is also offered in three (3) 
national TOEIC Secure Program (SP) administrations annually, involving 
new forms of the TOEIC, for which individual preregistration is 
required. 

In Japan, Korea, and several other countries, TOEIC - related 
assessment services are provided under the aegis of national TOEIC 
representative offices. In countries without national TOEIC repre- 
sentative offices, the TOEIC and TOEIC- related assessment services are 
obtained, by ad hoc arrangement with the TOEIC/ETS (Princeton) office, 
through the TOEIC International Corporate Program (ICP) . ETS is re- 
sponsible for test - development , scoring for SP administrations, and 
general oversight of TOEIC affairs worldwide. 
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TOEIC programs are most highly developed in Japan where the TOEIC 
was introduced in 1979 at the request of the Japanese Ministry of 
International Trade and Industry (MITI) . Currently, the majority of 
TOEIC corporate/institutional clients are located in Japan, as are 
about 80 percent of all TOEIC examinees - -hundreds of thousands of 
examinees have been tested in Institutional Program (IP) and Secure 
Program (SP) test administrations under auspices of the TOEIC Steering 
Committee in Japan. The TOEIC and TOEIC- related assessment services 
are also being used regularly, though to a lesser extent, In a number 
of other countries. 



Characteristics of the Examinee Population 

TOEIC examinees, worldwide, are likely to be 

1. adult ESL users/learners who are relatively highly educated, 
typically at or beyond secondary- educational levels in national 
educational systems that provide formal instruction in English as a 
foreign language (EFL) , and whose language - learning background is 
characterized by a core of academic EFL instruction, often with 
additional intensive ESL instruction; 



2. employed or preparing for employment in ESL- essential positions, 
at home or abroad, with a business engaged in international 
business, commerce, or industry; hence directly or indirectly 
screened on pertinent employment-related cognitive, educational, 
personal, or other criteria, including English proficiency; 

3. tested in their places of work or work-related intensive ESL 
training, in administrations under the supervision of local TOEIC 
representatives or company-designated personnel. 



TOEIC examinees in Japan, for example, are 
educated. They share a basic core of exposure to 
English-language instruction.-^ 



largely university- 
curriculum- embedded 



Focus of the Present Study 

The present study focuses primarily on data pertaining to the 
TOEIC testing context in Japan, where TOEIC/LPI data-sets have been 
generated for several years. All data-sets employed were derived from 
comprehensive operational ESL- assessments involving the concurrent use 
of the TOEIC and the LPI procedure, conducted in representative T0EIC- 
use settings by resident ESL professionals trained (and periodically 
recalibrated”) as LPI interviewers/raters in workshops conducted by 
TOEIC-ETS staff. Individual TOEIC-score data were available for 
general samples of Japanese TOEIC-SP examinees. 

Similar, but less extensive, TOEIC/LPI data-sets were also 
available for samples of examinees from representative TOEIC-use 
settings in three other countries: France (F) , Mexico (M) , and Saudi 

Arabia (S) . These data-sets were generated in comprehensive ad hoc 
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ESL assessments, also involving the joint use of the TOEIC and the LPI 
procedure, conducted by TOEIC/ETS staff members for corporate clients 
in those countries in 1987 and 1988. Due to the incipient nature of 
the testing programs in the countries involved, individual TOEIC- score 
data for general samples of French, Mexican, and Saudi examinees were 
not available for analysis. 



Analytical Approach and Study Procedures 

The TOEIC/LPI data for the samples described generally above were 
analyzed, using LPI performance as the criterion variable in the 
familiar regression model. On the basis of evidence and lines of 
reasoning developed in detail in the preceding section, it was assumed 
from the outset that the regression results would provide evidence 
needed to permit TOEIC users to make statistically delimited inter- 
pretive inferences from TOEIC scores, about probable level of oral 
English proficiency in samples of ESL users/learners from the TOEIC 
testing contexts such as those represented in the study. 

It was expected that this regression-based approach would gen- 
erate useful interpretive guidelines for the TOEIC because ratings 
(scores) on the LPI criterion have direct "representational value," 
and because regression results, by definition, can be expected to 
indicate the extent to which the TOEIC scores share that "repres- 
entational value." In other words, the regression results would 
contribute to the development of a defined "expectancy- set " about 
examinees' functional ability to use English based on their test 
scores . 

An assessment was made of the level and pattern of concurrent 
correlation between the LPI criterion and TOEIC scores (LC, R, and 
Total [LC+R]) in the comparatively large Japanese TOEIC/LPI cali- 
bration sample, in the several non-Japanese samples, individually, in 
the total non- Japanese sample, and in the combined sample of Japanese 
and non- Japanese examinees. 

It was hypothesized that the level of developed oral English 
proficiency (defined operationally by LPI performance) would be linked 
more closely to the level of developed English-language listening 
comprehens ic .1 (indexed by TOEIC-LC) than to the level of developed 
English-language reading ability (TOEIC-R) in samples of educated ESL 
users/learners in representative TOEIC-use settings. 

On logical/theoretical grounds, the ability to comprehend spoken 
English can be expected to affect performance in an interview situa- 
tion, in which listening comprehension is measured semi- directly . 
This is not true in the case of reading ability. TOEIC-R clearly is 
an indirect measure of oral English proficiency. 
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A question that is of both theoretical and pragmatic interest has 
to do with whether a composite of LC and R scores tends to be more 
valid for predicting criterion (LPI) performance, than either of the 
two scores alone. 

In analyzing the data, particular attention was given to evalua- 
ting the relative usefulness of three different TOEIC/LPI linkage 
equations, namely, (a) one equation based solely on the Listening 
Comprehension score, (b) a second equation based on TOEIC-Total (the 
simple sum of LC and R scores, informally weighted according to their 
standard deviations) , and (c) a third equation specifying a "best- 
weighted" composite of LC and R, based on regression results in a 
particular sample. Equations were generated using TOEIC/LPI data for 
(a) the Japanese sample, (b) the total non- Japanese sample (that is, 
the French, Mexican, and Saudi examinees), and (c) the combined 
Japanese and non- Japanese samples. 

Attention was focused first on an assessment of TOEIC/LPI rela- 
tionships in samples of Japanese examinees. LPI ratings were regres- 
sed on TOEIC scores to (a) evaluate the degree and nature of associa- 
tion between functionally scaled LPI ratings (the criterion variable) 
and TOEIC-LC, TOEIC-R, and TOEIC-Total, and (b) develop equations for 
estimating LPI ratings from TOEIC scores in the Japanese testing 
context. Equations developed in the calibration sample were used to 
estimate the distribution of criterion performance in general samples 
of Japanese TOEIC examinees. 

An analysis was then made of TOEIC/LPI relationships in the 
French, Mexican, and Saudi samples, in the total FMS (non- Japanese) 
sample, and in the combined FMS and Japanese samples. This analysis 
was designed to assess the consistency of TOEIC/LPI linkage across 
samples from several national TOEIC subpopulations . 

In addition, using data generated in the ad hoc assessments in 
Mexico and France, respectively, substudies concerned with two areas 
not directly at issue in the study were conducted, namely, (a) a 
substudy of the reliability of the LPI ratings employed and (b) a 
substudy of the predictive value of self-assessments of oral English 
proficiency. The latter substudy was suggested, in part, by the 
findings of Hilton et al. (1985). 

Before focusing directly on the details pertaining to the fore- 
going lines of inquiry, it is important to provide a brief descrip- 
tion of the TOEIC and its psychometric properties; also to elaborate 
briefly on the nature of the testing program in Japan, and to describe 
the role of direct proficiency assessment in the TOEIC testing program 
in Japan- -that is, to describe factors associated with the development 
and maintenance of a cadre of ESL professionals trained in the LPI 
procedure, in representative TOEIC-use settings in Japan. 
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Characteristics of the TOEIC 



The TOEIC Is a multiple-choice, norm-referenced ESL proficiency 
test that provides measures of English language listening compre- 
hension and reading abilities, respectively. (For an independent 
review and evaluation of the TOEIC, see Perkins [1987; 81-82]; see 
Woodford [1982] for developmental detail). 

According to the Guide for TOEIC Users (ETS, 1986a: 1), reading 
items reflect the types of skills involved in comprehending types of 
" . . . materials that people in the business world use, including 
manuals, reports, forms, notices, advertisements, periodicals, and 
memoranda." The listening items are designed to measure understand- 
ing of spoken English in real-life situations. 

Number-right raw-scores on the respective sections (listening and 
reading), each made up of 100 items, are translated into an arbitrar- 
ily defined standard- score scale with scores ranging from 5 to 495; a 
total score is derived simply by adding the two scaled section- scores . 
About two hours of actual testing time are involved. 

The original form of the TOEIC was developed (in 1979) using 
items of appropriate difficulty for samples composed predominantly of 
university-educated adult Japanese nationals in or preparing for 
positions requiring the use of English as a second language (Woodford, 
1982). ETS develops three different forms of the test each year. 
These forms are equated, through statistical linkage formulas, to 
assure comparability of scores across successive forms. 

Equating computations are carried out using data for samples of 
Japanese examinees who participate in regularly scheduled Secure 
Program (SP) test administrations (e.g., Angell, Gallagher, and 
Schneider, 1988). Computerized data files for SP administrations 
(offered only in Japan and Korea, as of 1989) are maintained by ETS 
(Princeton) for purposes of test development and analysis. 

Reliability coefficients for the two section scores in these 
equating samples tend to be in the mid-. 90' s; total score reliability 
typically is slightly higher than that for either section. Thus, the 
TOEIC provides a highly reliable basis for assessing individual and 
group differences in acquired English language listening comprehension 
and reading skills. 

Evidence of Concurrent Validity 

The test has face validity as a measure of reading and listening 
comprehension in English. Available empirical evidence (e.g., Wood- 
ford, 1982) suggests that the TOEIC-LC and TOEIC-R scores are cor- 
related relatively strongly with corresponding LC and Reading Compre- 
hension 6c Voca 1 'ilary scores on the Test of English as a Foreign Lan- 
guage (TOEFL) , test that is widely used to assess the English lan- 
guage skills of foreign-ESL students applying for admission to U.S. 
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and Canadian colleges and universities (see, for example, ETS , 1985a). 
However, each of the two tests contains item types not found in the 
other . 

Table 1 provides information regarding concurrent relationships 
between TOEIC and TOEFL scores in a sample of Japanese test takers. 
Test means were not reported for the TOEIC/TOEFL sample for which 
intercorrelations are shown in the table. Typical reliability coef- 
ficients for the two tests are also shown. 

In a TOEIC -validation study (Woodford, 1982) involving data for a 
sample (N « 99) of Japanese examinees from the introductory (1979) 
test administration in Japan, TOEIC-LC was relatively highly correl- 
ated (about .80) with concurrent LPI ratings (based on interviews 
conducted by native -English speaking ESL professionals in Japan, 
trained especially for the study). Correlations at about this level 
were also reported for TOEIC-LC and/or TOEIC-R with concurrent direct 
measures of listening, reading, and writing that were developed ad 
hoc. The direct listening and reading measures involved, for example, 
taped and written English stimuli, with questions and answers in 
J apanese . 

The results of Woodford's (1982) study indicated a relatively 
high level of concurrent correlation between TOEIC scores and all the 
direct measures of proficiency, including the Language Proficiency 
Interview . 

Introduction of the LPI Procedure 
in TOEIC-Use Settings in Japan 

The interviewer/raters who generated the LPI ratings used by 
Woodford (1982) were recruited and trained especially for the ad hoc 
validity study. To assure the continued availability of a cadre of 
trained LPI interviewers/raters in the Japanese TOEIC- test ing con- 
text, a TOEIC-ETS staff member conducted in Japan (in 1982) the first 
of a continuing series of workshops designed to provide training for 
conducting and rating interviews. 

The participants in these worktops are native -English- speaking 
ESL professionals resident in Japan. They typically are responsible 
for conducting continuing, on-site intensive programs of ESL training 
sponsored by corporations, or for conducting such programs in educa- 
tional institutions. In addition to using the LPI procedure in their 
respective employment contexts, from time to time some of these speci- 
alists, by arrangement with the TOEIC Steering Committee in Japan, 
provide interview-assessment services under TOEIC auspices for 
individuals or groups of individuals.- 1 - 7 
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Table 1 



Illustrative Intercorrelations and Reliability Data for the 
TOEIC and the TOEFL, Respectively, and Concurrent 
TOEIC/TOEFL Correlations in a Japanese Sample 

TOEIC TOEFL* 



TOEIC 


LC 


R 


Total 


LC 


S&WE 


RC&V 


Total 


LC 


(.92) 


.77 


[.94] 


.87 


.74 


.80 


NA 


R 




(.93) 


[.94] 


.78 


.85 


.87 


NA 


Total 






(.96) 


NA 


NA 


NA 


NA 


TOEFL 
















LC 








(.89) 


.67 


.68 


CO 

CTk 


S&WE 










(.86) 


.78 


[-92] 


RC&V 












(.90) 


[-92] 


Total 














(.95) 


Note . 


Entries in parentheses are 


estimated reliability coeffici 


ents in 


general 


examinee 


s amp 1 e s ; 


entries 


in [brackets] are part 



whole coeffients. 

o The TOEIC intercorrelations and reliability coefficients are as 
reported by Woodford (1982) for a sample of Japanese TOEIC exam- 
inees. The TOEIC/TOEFL correlations (upper right portion of the 
table) were obtained in a subgroup of examinees from the Woodford 
sample, as reported by ETS (1982a). Total-score correlations were 
not reported. 

o TOEFL intercorrelations and reliability data (lower right quad- 
rant in the table) are for a genera- sample of TOEFL examinees (ETS, 
1985a) . 

*The TOEFL sections are: Listening Comprehension (LC) , Structure 

and Written Expression (S&WE) , Reading Comprehension and Vocabulary 
(RC&V) . 
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Members of the cadre of Japan-based ESL professionals involved in 
these TOEIC-related LPI workshops generated the TOEIC/LPI data-sets 
for the samples of Japanese examinees involved in the present study. 

Source of TOEIC/LPI Data for Japanese Examinees 

One data-set consisting of TOEIC scores and LPI ratings for 122 
individuals was collected at the initiative of the TOEIC Steering 
Committee in 1985. Three additional data-sets (for a total of 163 
individuals) were collected during the course of periodic, compre- 
hensive ESL proficiency assessments involving the joint use of TOEIC 
and interviews that were conducted (in 1984, 1986, and 1987) by the 
English Department of the Institute for International Studies and 
Training (IIST) , a graduate -level business school (and a regular 
institutional TOEIC subscriber). 18 The TOEIC/'LPI-calibration sample 
thus consisted of what may be referred to as "TOEIC" and "IIST" 
subsamples . 

The TOEIC subsample . The TOEIC subsample was selected and tested 
for the explicit purpose of evaluating concurrent TOEIC/LPI relation- 
ships in samples from a representative array of TOEIC-use contexts 
identified as corporations or other organizations in the Tokyo area 
that regularly use TOEIC services. On-site interviews (taped, and 
rated by at least two individuals) were conducted for previously 
tested TOEIC examinees in a number of corporate or ESL training sites; 
a few individuals were interviewed in the TOEIC office. 

The LIST subsamples . The IIST, among other programs, offers a 

nine-month business -or iented training program conducted in Japanese, 
supplemented by intensive ESL instruction. The program consists of an 
eight-week intensive English course, a 14-week course in Area Studies 
and Basic Economics and another 14-week course in International Man- 
agement and Economics. Trainees in the program typically are selected 
by a sponsoring company or governmental agency, not by the IIST. The 
trainees are predominantly male. An estimated 95 percent are univers- 
ity graduates ; as such they typically have had a core of academic 
exposure to the study of English as a foreign language. 

The English-language needs of trainees vary: some are scheduled 
for overseas assignments or for ESL-essential jobs in Japan after 
program completion; ESL needs are less immediate for other trainees. 
The TOEIC and the LPI are used jointly so that sponsoring organiza- 
tions will have "a recognized norm by which they can measure the 
English-language skills of their trainees” (Reilley, March 3, 1988, 
facsimile communication). TOEIC/LPI data-sets for three groups of 
IIST trainees evaluated toward the end of the ESL segment of the 
program in 1984, 1986, and 1987, respectively, were made available for 
use in the present study. 
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Analysis of TOEIC/LPI Relationships in Samples 
of Japanese Examinees 

Means and standard deviations of the study variables are shown 
for the four Japanese samples in Table 2. They are labelled according 
to origin as TOEIC-85, IIST-84, IIST-86, and IIST-87. For perspective, 
comparable statistics are provided for a one-third sample of Japanese 
Secure Program examinees from the September 1987 administration (data 
from TOEIC-ETS files) . 

The average TOEIC scores in the TOEIC/LPI samples were somewhat 
higher than those for SP examinees generally- -TOEIC -Total means were 
618 and 548, respectively. 

It is apparent that these numbers do not convey any information 
about how well Japanese examinees are able to function in English. By 
inference, examinees in the TOEIC/LPI sample are likely to have more 
functional ability to use English than those in the general SP sample. 
On the other hand, the fact that the mean LPI rating for the calibra- 
tion sample was approximately at LPI Level 2 conveys some directly 
interpretable information. 

For example; 

o Interviewees at LPI Level 2 able to participate fully in casual 
conversations, can speak in extended discourse and express facts, 
give instructions, describe, and provide narration about current 
past and future activities (see the detailed description of this 
level in Appendix A) . 

The four subsamples appear to be similar with respect to overall 
patterns of performance on the study variables. This general impres- 
sion is confirmed by the results of a multiple discriminant analysis 
(MDA) , not reported in detail, indicating that the joint distributions 
of TOEIC scores and LPI ratings for the groups were not significantly 
different . ^ 

Regression Results 

Table 3 shows intercorrelations of the study variables in the 
combined (Japan) sample and in the four subsamples. Selected results 
of multiple regression analyses in the respective samples are shown in 
the upper right portion of the table- -that is, standard partial 
regression coefficients (beta weights) and multiple correlation 
coefficients that were obtained when LPI ratings were regressed on 
TOEIC LC and R scores in the respective samples. Several trends are 
noteworthy . 

o TOEIC-LC was more closely associated with the LPI criterion than 
was TOEIC-R, as hypothesized, except in the IIST-1984 data-set. 
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Table 2 





Summary 


Statistics 


for the 


Calibration 


Sample (s ) 














TOEIC 


scores 






LPI* 


Sample 


N 


LC 




R 




Total 










M 


SD 


M 


SD 


M 


SD 


M 


SD 


JAPAN 


285 


316 


83 


302 


77 


618 


151 


1.86 


0.67 


TOEIC-85 


122 


311 


92 


300 


78 


611 


160 


1.89 


0.72 


TIST-84 


66 


313 


77 


295 


76 


608 


145 


1.78 


0.66 


IIST-86 


55 


329 


73 


310 


72 


640 


137 


1.87 


0.59 


IIST-87 


42 


315 


76 


309 


83 


624 


151 


1.93 


0.61 


SP- 87 

-A- 


3558** 


288 


91 


260 


91 


548 


172 


N.A. 


*** 



LPT Level (See Appendix A for detailed descriptions of levels) 



5 - Functions equivalent to an educated native speaker. 

4 - Able to tailor language to fit audience, counsel, persuade, 
negotiate, represent a point of view, etc. 

3 - Can converse in formal and informal situations, describe in 
detail, offer supported opinions, etc. 

2 - Able to fully participate in casual conversations, can express 
facts, give instructions, describe, report, and provide narra- 
tion about current past and future activities. 

1 - Can create with the language, ask and answer questions, 
participate in short conversations. 

0 - No functional ability. 

Randomly selected examinees from the September, 1987 SP admin- 
istration (data from TOEIC-ETS files) 

* x *To ^ estimated 
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Table 3 



Intercorrelations of Variables in the Calibration Sample (s), 
and Results of Multiple Regression Analysis 
(Beta Weights and Multiple Correlation Coefficients) 



Variable/Sample Simple Correlations Regression results 

Beta weights (R) 



N LC R 



LC- JAPAN 


285 


— 


,79 


TOEIC-85 


122 




— 


.78 


IIST-84 


66 




— 


.80 


IIST-86 


55 




— 


.81 


IIST-87 


42 




— 


.80 


R- JAPAN 


285 


— 


--- 


TOEIC-85 


122 




— 


— 


IIST-84 


66 




— 


— 


IIST-86 


55 




— 


— 


IIST-87 


42 




— 


— 


TOTAL- JAPAN 


285 


— 


— 


TOEIC-85 


122 




— 


— 


IIST-84 


66 




— 


— 


IIST-86 


55 




— 


— 


IIST-87 


42 




— 


— 



TOTAL 


LPI 


LC 


R 




(-95) 


.75 


.55 


.26 


• 77 


(.95) 


.79 


.58 


.26 


.80 


(.95) 


.67 


.33 


.42 


.71 


(.95) 


. 80 


.82 


- .01 


.80 


(.94) 


.73 


.49 


.30 


.76 



( .94) .69 



(.94) .72 
(.95) .68 
(.95) .65 
(.95) .70 

.76 



.80 

.71 

.76 

.75 



Note . Underscored values are for the combined sample; coefficients in 
parentheses reflect part-whole correlation. Selected results of the 
regression analysis are shown in the upper right portion of the table 
standard partial regression (beta) weights for LC and R, and the mult 
pie correlation coefficient (R) , obtained in analyses involving data 
for the respective samples. 
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o The coefficient for TOEIC-LC was comparable to that for TOEIC- 
Total, and only slightly lower than the multiple correlation 
coefficient reflecting the best -weighted combination of the LC and 
R. 20 

o The Total score (LC plus R) , was about as closely related to the 
LPI criterion as was the best weighted combination of LC and R (com- 
pare multiple correlations, shown in the rightmost column of Table 
3, with simple correlations for TOEIC-Total. 

Consistency of LPI -estimation from TQEIC scores . To assess 
stability of fit between observed and estimated LPI ratings across the 
four samples, a residual analysis was performed. Using data for the 
combined Japan sample, LPI ratings were regressed on TOEIC-LC, TOEIC- 
LC and TOEIC-R, and TOEIC-Total. The resulting regression equations 
were used to compute three different criterion estimates and the 
corresponding residual values (that is, observed minus estimated LPI 
ratings) for each individual. Means and standard deviations of the 
residual values were then computed for each of the subsamples. 

Results of a one-way analysis of variance (ANOVA) of the resid- 
uals, shown in Table 4, indicate a very close fit between the ob- 
served and estimated values across the four samples, regardless of the 
equation employed. In all instances, the mean difference between 
estimated and observed LPI ratings was less than 0.1 on the 11-point 
LPI scale. The residual standard deviations were comparable to the 
standard error of estimate. 

Inferring LPI Performance from TOEIC Scores 

On the basis of the foregoing findings, data for the four sub- 
samples were combined for analyses designed to highlight the degree of 
fit between estimates of LPI based on each of two TOEIC- score equa- 
tions, and actual LPI ratings throughout the range of TOEIC scores 
represented in the study sample. 

Table 5 shows, for designated TOEIC-Total intervals (upper sec- 
tion) and TOEIC-LC intervals (lower section), (a) the number of ex- 
aminees in the calibration sample, (b) the LPI level expected for 
individuals at the midpoint of each interval, based on the regression 
equation, and (c) the mean and standard deviation of the observed LPI 
ratings for examinees in each interval. The standard error of estim- 
ate in each case was approximately .50 (.45) on the LPI scale (see the 
"Actual S.D." values in the last column of Table 5). 

Fit between actual and estimated LPI means at various TOEIC score 
levels is shown in Figure 4a (for TOEIC Total) and Figure 4b (for 
TOEIC-LC) . The points plotted in the two figures correspond to the 
interval -mean LPI ratings of individuals in the calibration sample. 

In each figure, the points conform closely to the line specified by 
the regression equation, throughout the range of TOEIC scores repre- 
sented in the calibration sample- -100 or higher for LC , 200 or higher 
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Table 4 



Results of Residual Analysis for the Japanese Calibration 
Subsamples Using Several Linkage Equations 

LPI criterion estimated by. 

LC* LC & R** TOTAL*** 





Me an 


SD 


Mean 


SD 


Mean 


SD 


Sample 


resid . 


resid. 


resid. 


resid 


resid. 


resid 


TOTAL 285 


.00 


.44 


.00 


.43 


.00 


.43 


TOEIC-85 122 


.05 


.44 


.04 


.40 


.04 


.43 


IIST-84 66 


-.07 


.50 


- .06 


.47 


- .05 


. 47 


IIST-86 55 


-.07 


.40 


- .07 


.36 


- .07 


. 38 


IIST-87 42 


.04 


.42 


.05 


.40 


.04 


. 41 


F-Ratio 


1.80 




1.52 




1.29 




Prob ability 


0.14 




0.21 




0.27 




*LPI = ( . 006067*LC) 






- .049348 


[R = 


• 75] 


**LPI - (.004401*LC) 


+ ( . 002266*R) 


- .208272 


[R - 


.77] 


***LPI - ( ,003376*T0TAL) 




- .220179 


[R - 


.76] 
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Table 5 



Estimated and Observed LPI Levels Associated with Designated Levels 
of Performance on TOEIC Total and TOEIC Listening Comprehension, 

Respectively 



TOEIC-Total 

interval 


Midpoint 


N 


Mean LPI 
Esti- 
mated 


rating 

Actual 


Actual 

S.D. 


200-299 


250 


( 5) 


.62 


.90 


.42 


300-399 


350 


(16) 


.96 


1.00 


. 37 


400-499 


450 


(41) 


1.30 


1.38 


.42 


500-599 


550 


(68) 


1.64 


1.59 


. 38 


600-699 


650 


(67) 


1.97 


1.89 


.44 


700-799 


750 


(48) 


2.31 


2.29 


.49 


800-899 


850 


(31) 


2.65 


2.68 


.47 


900+ 

TOEIC-LC 

interval 


950 


( 9) 


2.99 


3.06 


.53 


100-149 


125 


( 4) 


. / 2 


. 63 


. 25 


150-199 


175 


(18) 


1.01 


1.19 


. 39 


200-249 


225 


(37) 


1.32 


1.34 


.46 


250-299 


275 


(62) 


1.62 


1.58 


. 38 


300-349 


325 


(72) 


1.92 


1.85 


.45 


350-399 


375 


(38) 


2.23 


2.11 


.47 


400-449 


425 


(37) 


2.53 


2.65 


.53 


450+ 


475 


(17) 


2.83 


2.85 


.49 


Total sample 




(285) 


1.86 


1.86 


.67 



Note . Estimated LPI values (LPlest) are based on the following 
equations: LPlest (Total) = ( . 0033 76*Total) - .220179 



LPlest (LC) = (.006067*LC) - .049348 
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iunrltonol Ipvel- 0 (no proficiency! through ! Fundionol level: 0 (no proficiency) througn 

5 (equivalent to educated native sneaker) j 5 (equivalent to educated native speaker) 



Fig ure 4a. Fit between actual LPI means and means 
estimated from TOEIC-Total, by Total-score intervals: 
Data for the calibration sample (N = 285) 




Figure 4b. Fit between actual LPI means and means 
estimated from TOEIC-LC, by LC-score intervals: 
Data for the calibration sample (N = 285) 
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for Total. The horizontal lines are spaced at .5 intervals that cor- 
respond, approximately, to the respective standard errors of estimate. 

Average LPI expectancy increases directly with TOEIC performance. 
For example, an average LPI rating at Level 1 is expected for exam- 
inees with Total scores of about 350, an average average rating of 
Level 2 for those with Total scores at the 650 level, and an average 
rating of Level 3 for those scoring 950. The amount of variability 
expected in the LPI ratings at each score level is defined, 
statistically, by the standard error of estimate. 

Tables 6.1 and 6.2 provide more comprehensive information. These 
are expectancy tables that show the actual distributions of LPI rat- 
ings (in percent) by TOEIC-score intervals, for TOEIC-Total and TOEIC- 
LC , respectively. The distribution of LPI ratings for the total cali- 
bration sample is also shown in the tables. From the tables it may be 
inferred, for example, that 90 percent or more of examinees with Total 
scores of 900 or higher or LC scores of 450 or higher earned LPI rat- 
ings of at least Level 2+; that the modal LPI rating for examinees in 
the 600-695 range (average of about 650) was Level 2; and so on. 

Judging from the expectancy tables and previously considered 
findings, it appears that inferences about examinees' LPI performance 
(level of oral English proficiency) are likely to be equally valid, 
whether based on TOEIC-LC or on TOEIC-Total, and that inferences about 
criterion performance from TOEIC-Total, in turn, are comparable sta- 
tistically to those based on complex, regression-weighted composites 
of LC and R. 

This outcome is understandable statistically because TOEIC-LC is 
very highly correlated with TOEIC-Total (r's of about .95, artifactu- 
ally inflated due to part-whole [self] correlation). In addition, the 
LC and R scores themselves are closely related (r's of about .80). The 
outcome is theoretically consistent because the ability measured by 
TOEIC-LC (to comprehend spoken English) is an integral aspect of the 
functional ability (to comprehend and produce utterances in English) 
that is assessed in the Language Proficiency Interview; reading 
skills, on the other hand, are not assessed directly or semi - directly 
in the interview situation. 

At the same time, it should not be overlooked that TOEIC-R scores 
were relatively strongly correlated with the LPI criterion in the cal- 
ibration sample (r *.69). This means that in the hypothetical absence 
of LC scores, very useful inferences about LPI performance could be 
drawn from examinees' scores on the TOEIC Readirg section only- -an 
indirect measure of oral English proficiency. To emphasize this 
point, trends in TOEIC-R/LPI relationships are shown in Table 6.3 and 
Figure 4c. These trends strongly parallel those for LC/LPI rela- 
tionships shown above (in Table 6.2 and Figure 4b). ^2 
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Table 6.1 



Relationship between TOEIC Total Score and LPI Rating 
in English: Japanese Sample 



Percent with LPI rating 



TOEIC 


0+ 


1 


1+ 


2 


2+ 


3 


>3 


Total 


Total 


















900+ 










33 


33 


33 


(9) 


800-895 








16 


45 


29 


10 


(31) 


700-795 






12 


38 


31 


16 


2 


(48) 


600-695 




9 


22 


57 


8 


4 




(67) 


500-595 




19 


46 


34 


2 






(68) 


400-495 


7 


24 


56 


10 


2 






(41) 


300-395 


25 


50 


25 










(16) 


200-295 


40 


40 


20 










( 5) 


Tota 1 


3.2 


13.7 


28.1 


30.9 


13.7 


8.1 


2.5 




(N) 


9 


39 


80 


88 


39 


23 


7 


(285) 


Note. This 


table 


is based on 


data for 


285 TOEIC examinees 


tested 



in Japan. 



Table 6.7. 

Relationship between TOEIC Listening Comprehension Score and 
LPI Rating in English: Japanese Calibration Sample 



Percent with LPI rating 



TOEIC 

LC 


0+ 


1 


1+ 


2 


2+ 


3 


> 3 


Total 


450 + 








6 


41 


35 


18 


07) 


400-445 






3 


19 


38 


30 


11 


(37) 


350-395 




3 


18 


42 


29 


8 




(38) 


300-345 




11 


21 


58 


6 


4 




(72) 


250-295 


2 


13 


56 


26 


3 






(62) 


200-245 


8 


35 


41 


14 


3 






(37) 


150-195 


11 


44 


39 


6 








(18) 


< 150 


75 


25 












( 4) 


Total 


3.2 


13.7 


28.1 


30.9 


13.7 


8.1 


2.5 




(N) 


( 9) 


(39) 


(80) 


(88) 


(39) 


(23) 


( 7) 


(285) 


LPI 


0+ 


1 


1+ 


2 


2+ 


3 


> 3 




Note. This 


tabl e 


is based on 


data for 


285 TOE 


IC examinees 


tested 



in Japan. 
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Table 6.3 



Relationship between TOEIC Reading Score and LPI Rating 
in English: Japanese Calibration Ssmple* 

Percent with LPI rating 



TOEIC 


0+ 


1 


1+ 


2 


2+ 


3 


>3 


Total 


Reading 
















( N) 


450+ 












100# 




( 1) 


400-445 






3 


19 


29 


32 


16 


(31) 


350-395 




7 


7 


33 


36 


16 


2. 


(58) 


300-345 




8 


27 


47 


12 


5 


2 


(58) 


250-295 




15 


48 


37 








(62) 


200-245 


7 


27 


48 


18 








(44) 


150-195 


19 


38 


25 


12 


6 






(16) 


< 150 


33 


33 


33 










( 9) 


Total 


3.8 


11.5 


26.0 


31.8 


14.2 


10.2 


2.6 




(N) 


(15) 


(45) 


(102) 


(125) 


(56) 


(40) 


(10) 


(393) 



* These are joint TOEIC/LPI data for 285 Japanese TOEIC examinees. 

# Based on a single case. 



Figure 4c. Fit between actual LPi means and means 
estimated from TOE1C-R: Data for the calibration 

sample (N = 285) 
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Estimating the Distribution of Criterion Behavior 
In General Samples of Japanese Examinees 

It may be recalled (from Table 2) that the TOEIC-Total mean for 
the calibration sample was 618 (SD = 151) , and the TOEIC-LC mean was 
316 (SD « 83) , as compared to means of 548 (Total) and 288 (LC) for a 
general sample of Secure Program examinees tested in September, 1987. 
Secure Program (SP) examinees are more highly selected than examinees 
tested in the TOEIC Institutional Program (IP). 

According to information provided by the TOEIC Steering Commit- 
tee in Japan, means for IP examinees generally are approximately 200, 
200, and 400 for LC, Reading, and Total, respectively. As indicated 
earlier, individual score data were not available for IP examinees. 

To provide perspective on the probable distribution of functional 
ability to use English in face-to-face conversation in the general 
Japanese -examinee population, LPI ratings were estimated from TOEIC- 
Total for a sample of examinees from the TOEIC SP administration 
conducted in Japan in September 1987; estimates were also made of the 
mean LPI rating for the IP population . ^ 

The distribution of estimated LPI levels for the SP sample is 
shown in Figure 5. The LPI mean was 1.6 (Level 1+) and the standard 
deviation was .5. These corresponded to the TOEIC-Total mean of 548 
and standard deviation of 172. 

Based on the mean Total score of 400 reported for IP examinees, 
the mean estimated LPI performance for this sample is approximately at 
Level 1 (estimated mean - 1.13). Given a standard error of estimate 
of approximately .5, the majority of Japanese IP examinees probably 
are functioning conversationally at or below Level 1+ (1.5). 8 

Taking at face value the findings reflected in Figure 5, and 
considering that these findings do not reflect data for the lower- 
scoring IP examinees, certain general inferences as to the dis- 
tribution of criterion behavior (LPI performance) in the TOEIC 
examinee population in Japan are warranted. For example: 

1. Relatively few Japanese TOEIC examinees (SP and IP combined) are 
likely to be functioning conversationally at or higher than LPI 
Level 3. 

2. The typical level of developed ESL conversational ability in the 
general TOEIC examinee population in Japan (combined IP and SP) is 
about as described for Level 1; the majority probably are function- 
ing approximately between Level 0+ (.5) and Level 2+ (2.5). 
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Figure 5. Distribution of estimated levels of oral ESL 
proficiency for a sample of Japanese TOEIC Secure 
Program examinees 
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See Appendix A for detailed descriptions of levels. 

Functions equivalent to an educated native speaker. 

Able to tailor language to fit audience ... on 
all topics pertinent to professional needs. 

Can converse in formal and informal situations . . . 

deal with unfamiliar topics . . . offer supported 

options . 

Able to participate fully in casual conversations, 
speak in extended discourse, express facts, give in- 
structions, report, add provide narration about cur- 
rent, past and future activities. 

Can create with the language, ask and answer ques- 
tions, participate in short conversations. 

No functional ability. 
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Stability of TOEIC/LPI Relationships in Samples from 
Diverse TOEIC Testing Contexts 

The findings reviewed above provide evidence regarding the pat- 
tern of concurrent correlations between TOEIC scores and LPI ratings 
in several samples from the majority (Japanese) examinee subpopula- 
tion, the average level and range of behaviorally defined LPI perform- 
ance’ that can be expected of Japanese examinees who present particular 
TOEIC scores, and the probable distribution of criterion performance 
in the TOEIC testing context in Japan. 

Questions naturally arise as to whether TOEIC/LPI relationships 
are consistent for ESL users/learners likely to be tested with the 
TOEIC in other countries. It was possible to conduct analyses of the 
consistency of TOEIC/LPI relationships across diverse samples, using 
data available for examinees from TOEIC-use settings in France, 
Mexico, and Saudi Arabia- -that is, TOEIC/LPI data-sets for samples of 
employees in ESL-essential jobs in the Paris office (N - 56) and the 
Mexico City office (N - 42) of an international accounting firm, and 
Saudi employees (N - 10) of an international petroleum corporation. 

Although small, these samples of educated, adult ESL/users learn- 
ers were from representative TOEIC-use settings --places of work or 
work-related ESL training that are generally similar in nature from 
country to country. Data for general samples of TOEIC examinees from 
France, Mexico, and Saudi Arabia were not available for analysis. 

TOEIC/LPI Correlations in Diverse Samples 

Table 7 shows the observed pattern of concurrent TOEIC/LPI cor- 
relations for the French (F) , Mexican (M) , and Saudi (S) samples, 
individually, the total FMS sample (N - 108), the combined FMS and 
Japanese samples (N - 393), and the Japanese calibration sample (N - 
285). Means and s', \dard deviations are also shown. 

From Table 7, it is apparent that the general pattern of concur- 
rent TOEIC/LPI relationships was similar across all the samples. Co- 
efficients were somewhat lower in the French sample than in the Mexi- 
can and Saudi samples, consistent with differences in TOEIC- score 
variability. Standard deviations were larger in the Saudi sample and 
in the Mexican sample than in either the French sample or the Japanese 
sample . 

The coefficient for TOEIC-LC typically was larger than that for 
TOEIC-R, and comparable to the coefficient for TOEIC-Total. In the 
FMS sample (N =* 108) , the coefficient for LC was slightly higher than 
that for Total (with the reading component) . 

In the FMS sample, when LPI was regressed on LC and R treated as 
independent predictors (results not shown in Table 7) , the resulting 
multiple correlation coefficient (.744) was essentially identical to 
the simple LC/LPI correlation (r = .7439). In a similar analysis for 
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the combined FMS and Japanese samples (N = 393) , the bes t - we ighted 
LC+R composite yielded a multiple correlation coefficient (R = .753) 
that was only very slightly higher than the simple LC/LPI and Total/ 
LPI correlations shown in Table 7 (r = .745 in both instances). 



Table 7 

Data on Stability of TOEIC/LPI -Criterion Relationships Across 
Samples from Different TOEIC-Use Contexts 



Sample N Correlation Means and standard deviations 

with LPI 







LC 


R 


Total 


LPI 




LC 




R 


Total 






r 


r 


r 


Mean 


SD 


Mean 


SD 


Mea 


n SD 


Mean SD 


France -87(F) 


56 


. 62 


. 58 


. 65 


2.30 . 


64 


428 


74 


389 


48 


817 


113 


Mexico -87 (M) 


42 


. 78 


. 70 


. 76 


1.71 . 


62 


262 


106 


237 


104 


4 99 


204 


Saudi-87 (S) 


10 


.85 


.86 


.87 


1.95 . 


93 


304 


107 


184 


114 


489 


217 


(FMS total) 


108 


. 74 


. 67 


. 73 


2.04 . 


71 


352 


120 


311 


115 


663 


229 


(Japan total) 


285 


. 75 


. 69 


.76 


1.86 . 


67 


316 


83 


302 


77 


618 


151 


Comb ined 


393 


.74 


. 68 


.74 


1.91 . 


68 


325 


96 


305 


89 


630 


177 



Consistency of LPI Estimation Across Diverse Samples: 

A Residual Analysis 

These correlational findings indicate that within the 
several nationally defined samples, and in the two nationally and 
linguistic- ally heterogeneous "general" samples -- that is, the total 
FMS sample, and the combined study sample- -LPI -criterion performance 
varied more closely with the TOEIC-LC than with TOEIC-R, and the 
coefficient for LC only was comparable to that for TOEIC-Total. 
However, evidence of consistency in patterns of concurrent TOEIC/LPI 
correlations, alone, does not shed direct light on a question that is 
of considerable theoretical as well as practical interest: 

Will ESL users/learners (of the type likely to be taking the TOEIC) 
who present particular TOEIC scores tend to exhibit about the same 

average level of LPI performance, regardless of national- linguistic 
origin?^ 0 & 

Evidence bearing on this question was obtained by analyzing dif- 
ferences across the four national samples in mean residuals associated 
with three sets of regression-equations for estimating LPI, namely, 
Set A (estimates from f. weighted composite of LC & R) , Set E. (estim- 
ates from TOEIC-Total), and Set C (estimates from TOEIC-LC only), each 




mi 
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set including equations reflecting data for three different TOEIC/LPI 
"calibration samples:" the FMS sample (N = 108), the combined FMS and 
Japanese samples (N = 393), and the Japanese sample (N - 285). 

Nine residual values, one associated with each of the nine 
linkage equations, were computed for each individual in the study 
sample. 27 Mean residuals for the four application samples (the 
French, Mexican, Saudi, and Japanese samples) and the three cali- 
bration samples (FMS, Japan, Combined) are shown in Table 8. Results 
of one-way analysis of variance tests of differences in mean residuals 
associated with the respective linkage equations are also shown. 



Figure 6 is a plot of the mean residuals shown in Table 8. By 
reference to the figure a general evaluation may be made of trends in 
the relative size (in absolute value) of the residuals- - the smaller 
the mean residual, the better the fit between average level of cri- 
terion performance and average level estimated from a particular 
regression equation. Differences in mean residuals, though statis- 
tically significant in most instances, were comparatively small in 
absolute magnitude -- that is, less than / .25/ on the LPI scale (0-5). 



Mean residuals associated with TOEIC-LC (Set C) equations typ- 
ically were smaller than those associated with equations, involving 
composites of LC and R--that is, either equations involving TOEIC- 
Total (Set B) or equations reflecting regression-weighted composites 
of LC and R (Set A) . 



Except in the case of the Saudi sample, the mean residuals 
associated with composite - score equations (Set A or Set B) were 
generally similar to the mean residuals associated with LC equations 
(Set C) . For example, in the Japanese, French, and Mexican applica- 
tion samples, mean LC-related residuals ranged, in absolute value, 
between / . 00 / and / .25/ (the larger means were associated with 
Japanese-based calibration equations applied in the French and Mexican 
samples), indicating relatively close agreement. However, in the 

Saudi sample, mean residuals for Set A and Set B equations,, all in- 
fluenced bv TOEIC-R, were considerably larger in most instances 
J o 9 

(ranging up to / . 52/) . 

For the Saudi sample, the TOEIC-LC mean (304) was considerably 
greater than the TOEIC-R mean (184) , whereas these two means were not 
so divergent in the general calibration samples (e.g., 325 versus 305 
in the combined FMS and Japanese samples). In the same (Saudi) sam- 
ple, the LPI mean (1.95) was consistent with the higher LC mean rather 
than with either the low R mean or the R- influenced Total mean. Al- 
though the Saudi sample is small, based on evidence from the TOEFL 
testing context (ETS , 1983: 23), there is reason to believe that a 

pattern of higher average performance on measures of English-language 
listening comprehension than on measures of reading is characteristic 
of educated Saudi ESL users/learners . 30 
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Semples 



Table 8 



Mean Residuals for Study Samples in Analyses Involving LPI as 
Estimated from (a) Best-Weighted Composites of LC and R, 

(b) TOEIC Total Score, and (c) TOEIC-LC only, Using Equations 
Developed in Different "Calibration Samples" 



Set A 



Applica- 




Wtd LC + R 
Calibration 


tion 
s amp 1 e 


N 


FMS 


sample 

JAPAN 


Com' 


France 


56 


- .08 


- .25 - 


.16 


Saudi 


10 


.15 


.40 


.30 


Mexico 


42 


.07 


.23 


.17 


Japan 


285 


- .02 


.00 - 


.00 


Combined 


393 


- .02 


- .00 


.00 


FMS total 


108 


.00 


- .01 


.01 


F - ratio 




1.3 


12.9 6 


.0 


Prob 




.27 


.00 


.00 



Note . Underscoring indicates 
because the estimation 
for the corresponding 





Set B 






Set C 




Total score 




LC only 




Calibration 


Calibration 




sample 




s 


ample 




FMS 


JAPAN 


Comb 


FMS 


JAPAN 


Comb 


.09 


-.23 


- .15 


- .07 


- .24 


- . 15 


.30 


.52 


.44 


.12 


.15 


. 15 


.05 


.25 


.18 


.07 


.18 


. 14 


.03 


.00 


- .01 


- .02 


.00 


.00 


.05 


.01 


* 00 . 


- .01 


- .01 


.00 


.00 


.02 


.03 


,00 


- .04 


- .01 


.9 


14.3 


7.7 


1.0 


7.8 3 


.9 


.03 


.00 


.00 


.38 


.00 


.01 



that the mean is expected to be zero 
equation involved is based on data 
calibration sample. 



Figure 6. Mean residuals for samples when LPI level is est- 
imated from (a) best- weig hted composites of LC and R, 
(b) TOEIC-Total, and (c) TOEIC-LC only, using equations 
based on FMS data, total-sample data, and Japanese data 




| Mean residuals (on the LPi soal^') 

j * uv.ed to generate equations «''rjs 1 nw , j 



Equation base* 

F* M \ m o rj t o 
All data 
Japan data 
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It is noteworthy that in a sample with quite divergent means on 
TOEIC-LC and TOEIC-R, indicative of differential relative levels of 
development of the corresponding English-language skills, the average 
level of LPI-assessed oral English proficiency was indexed more accu- 
rately by average level of developed listening comprehension than by 
average level of developed reading ability. This is especially in- 
teresting in view of the fact that the several measures were highly 
intercorrelated in the sample- -TOEIC/LPI correlations were in the mid- 
80 's, for example. 

Figure 7a shows the regression of LPI rating on TOEIC-LC in the 
combined sample (N = 393); trends in LPI means by Total-LC interval 
are shown for Japanese examinees (broken line) and for FMS examinees 
(dotted line) . Figure 7b shows comparable trends involving TOEIC- 
Total. As expected from results of the residual analysis, LPI perform- 
ance in both samples conformed more consistently to expectancy based 
on TOEIC-LC than to expectancy based on TOEIC-Total. 

A summary of evidence regarding the relationship between TOEIC-LC 
and LPI ratings in the combined sample is provided in Figure 8 and 
Table 9. These displays are comparable to those shown earlier for the 
Japanese calibration sample (see, for example, Figure 4b and Table 
6.2, above ) . 
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Figure 7a. Regression of LPI rating on TOEIC-LC 
in the combined sample (N = 393), with plot of 
of actual mean rating by LC-score interval for 
the TOEIC-Japan and TOElC-FMS samples 



i> 0 
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TOEIC-LC 



Figure 7b. Regression of LPI rating on TOEIC-Total 
in the combined sample (N = 393), with plot of actual 
mean rating by Total-score interval for the 
TOEIC-Japan and TOEIC-FMS samples 
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Figure 8. Plot of mean LPl ratings by TOEIC-LC 
intervals: Total sample (N = 393) 




TOEIC-LC 



Table 9 

Relationship between TOEIC Listening Comprehension Score 
and LPI Rating in English: Combined Sample* 

Percent with LPI rating 



TOEIC-LC 


0+ 


1 


1+ 


2 


interval 










450-f 








17 


400-445 






9 


21 


350-395 




4 


18 


45 


300-345 




9 


20 


61 


250-295 


3 


11 


55 


26 


200-245 


7 


33 


42 


16 


150-195 


12 


40 


36 


12 


< 150 


64 


18 


18 




Total 


3.8 


11.5 


26.0 


31.8 


(N) 


(15) 


(45) 


(102) 


(125) 



* These are joint TOEIC/LPI data for 
in Japan (N = 285) , France (N = 56 
Arabia (N «* 10) . 



2+ 


3 


>3 


Total 








( N) 


30 


40 


13 


(47) 


36 


26 


8 


(53) 


26 


8 




(51) 


6 


3 




(88) 


6 






(73) 


2 






(45) 








(25) 








(ID 


14.2 


10.2 


2.6 




(56) 


(40) 


(10) 


(393) 


393 TOEIC 


: examinees tested 


, Mexico 


(N - 


42) and 


Saudi 
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Section IV: TOEIC/LPI RELATIONSHIPS- -FINDINGS , CONCLUSIONS, 

and suggested directions for further inquiry 

This study was undertaken to develop and evaluate regression-based 
guidelines for making inferences from (a) scores on the TOEIC, about 
(b) level of ability to use English in face-to-face conversation 
(indexed by performance in Language Proficiency Interviews), for (c) 
examinees in samples of ESL users/learners from the TOEIC testing 
context, using (d) data generated during the course of operational ESL 
assessments involving the joint use of TOEIC scores and the LPI pro 
cedure in diverse TOEIC-use settings. The findings reflect actual 
experience in representative test-use setting in Japan, France (F) , 
Mexico (M) , and Saudi Arabia (S) . 



Overview and Evaluati* \ of Findings 



Performance in the Language Proficiency Interview was strongly 
and consistently associated with TOEIC performance not only in the 
comparatively large TOEIC/LPI-calibration sample from _ the majority 
(Japanese) TOEIC test-taking subpopulation, but also in samples of 
examinees from three diverse, national TOEIC subpopulations. 

Trends in TOEIC/LPI relationships observed in these samples re- 
flect patterns of association that plausibly can be expected to hold 
in similar samples from the corresponding national TOEIC subpopula- 
tions and, by inference, in similar samples from the larger TOEIC 
testing context. Study findings are reviewed and evaluated in some 
detail, below, to highlight the evidentiary and theoretical foundation 
for this assertion, and for related conclusions and interpretive gen- 
eralizations about the findings. 

Consistent Pattern of Concurrent TOEIC/LPI Correlation 

There was a consistent pattern of concurrent correlation between 
TOEIC scores (LC, R, and Total) , and level of functional ability to 
use English in face-to-face conversation (LPI performance) in the 
study sample. The pattern was essentially as described below: 

1. TOEIC-LC/LPI correlations (typically in the mid-70's) were some- 
somewhat higher than TOEIC-R/LPI correlations (typically about .70). 

2. Simple correlations between the Total score (with the reading 
component) and the LPI criterion were about about the same as the 
LC/LPI coefficients --very slightly lower in some instances. 

3. When LPI performance was regressed on LC and R (treated as a 
battery of predictors, in the Japanese, the FMS [total non- Japanese ] 
sample, and the combined FMS and Japanese samples, respectively), 
the resulting multiple correlations (uncorrected for shrinkage) were 
only very modestly larger than the simple correlation for TOEIC- 
Total , or TOEIC -LC only. 
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Functional Linkage Suggested Between Listening Comprehension 
and Oral Language Proficiency 

Viewed from a theoretical perspective, evidence of consistently 
higher criterion- related validity for TOEIC-LC than for TOEIC-R, and 
lack of improvement in prediction when the R score is added to the LC 
score, suggests a strong underlying functional linkage between the 
ability measured by TOEIC-LC (to comprehend utterances in English) and 
the more complex ability assessed in the LPI situation (to comprehend 
and produce utterances in English). 

Even though TOEIC-R is substantially correlated with the LPI- 
criterion, it does not appear to be measuring criterion- related 
abilities that are different from those being measured by TOEIC-LC, 
with which TOEIC-R is relatively highly correlated (coefficients in 
the mid-.70's). This is a theoretically consistent finding: ability 
to comprehend spoken English is an integral aspect of the functional 
ability assessed in the face-to-face interview; this is not true of 
reading ability. 

The pattern of correlational findings suggests that examinees 
with relatively high (low) average levels of TOEIC - assessed ability to 
comprehend spoken English may be expected to perform relatively well 
(poorly) in the interview situation, on the average, regardless of 
their average level of reading ability. 

Results of the residual analysis reinforce this proposition. 

o When LPI was estimated from LC only, based on a general regres- 
sion developed using combined data for the Japanese, French, Mexi- 
can, and Saudi samples, mean residuals for the several application 
samples were comparably small (none was greater than /.15/ in 
absolute value on the 0-5 LPI scale). 

o When LPI was estimated ^rom a comparably derived Total-score 
equation (with the R component), the mean residual for the Saudi 
sample, was noticeably larger (/.44/) than mean residuals for the 
other samples (none greater than /.18/ in absolute value). 

o In the Saudi sample only, the TOEIC-LC mean was rather markedly 
higher than the TOEIC-R mean. 

Thus, in a sample with divergent LC and R means (indicating dif- 
ferential relative levels of development of the corresponding English- 
language macroskills) , the actual average level of criterion perform- 
ance conformed to expectation based on the average level of TOEIC- 
assessed LC rather than expectation based on the average TOEIC-Total 
score (and, by inference, the much lower TOEIC-R mean). This occurred 
despite the fact that the within- sample TOEIC/LPI intercorrelations 
were very strong (all coefficients were in the mid-.80's). 
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High correlations between various language skills suggest that 
"different aspects of language tend to be learned together . . . and 
advancement in any aspect of language is generally accompanied by 
advancement in other aspects" (Carroll, 1983: 94). However, in 
particular subpopulations, the rate and course of development in one 
aspect of second- language proficiency may differ from that in other 
aspects of proficiency, as illustrated by the markedly different 
TOEIC-LC and TOEIC-R means for the Saudi examinees. 

Consistent Evidence Pointing to Functional LC/LPI Linkage 

On balance, the evidence that has been reviewed suggests strongly 
that the ability to comprehend and produce utterances in English is to 
some extent "dependent," directly and functionally, upon the ability 
to comprehend spoken English. Accordingly, it follows logically that 
level of ability to use English in face-to-face conversation (indexed 
by LPI performance) is likely to vary relatively consistently with 
level of developed English-language listening comprehension (indexed 
by TOEIC-LC) , across as well as within samples of ESL users/learners 
from diverse TOEIC subpopulations such as those represented in the 
present study. 

Although the relationship between reading ability and LPI per- 
formance is relatively strong, it derives indirectly from criterion- 
related variance that is common to both the reading measure and the 
(functionally pertinent) listening comprehension measure. Even though 
performance on a measure of listening comprehension is likely to be 
relatively closely related to performance on a measure of reading 
ability in samples from diverse national subpopulations, it does not 
necessarily follow that the corresponding English-language macroskills 
are equally highly developed in the subpopulations involved. 

This suggests the "distinctness of listening and reading as 
traits," as concluded by Bachman and Palmer (1983) based on results of 
a factor study involving 10 ESL proficiency measures, 5 of reading and 
5 of speaking skills (including the LPI, administered by individuals 
trained for the ad hoc study).-* 2 At the same time, measures of lan- 
guage macroskills are relatively strongly intercorrelated in samples 
of educated, ESL users/learners. This is indicated by results of the 
present study, results reported by Bachman and Palmer (1983), and 
general research findings (see Hale, 1986, for a summary of research 
in the TOEFL testing context; see also Pike, 1979; Oiler, 1983, 
passim. ) . 



Inferring LPI Performance from TOEIC-LC in the Larger 
TOEIC Testing Context: Conclusions 

The evidence adduced in this study supports the following 
conclusion (thought of as a strong working hypothesis): 
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the level of LPI performance associated with particular levels of 
performance on TOEIC Listening Comprehension is likely to be rel- 
atively consistent across diverse samples from national subpopula- 
tions characterized by differential average levels of development of 
TOEIC-assessed English-language listening comprehension and reading 
skills. 

This is a necessary condition for establishing meaningful gen- 
eral, as opposed to "subpopulation specific," guidelines for inter- 
pretive inferences about performance on a criterion measure from 
predictor score(s).^ 

Based on the evidence and lines of reasoning developed above, the 
regression results for the combined sample of Japanese, French, Mexi- 
can, and Saudi examinees constitute guidelines that have interpretive 
relevance for test users in the larger TOEIC context- -certainly for 
general estimation purposes. 

Figure 9, for example, provides information regarding the per- 
centage of examinees by TOEIC-LC intervals that may be expected to 
earn designated LPI ratings. 

o The data suggest that a substantial j >rity of examinees with 
TOEIC-LC scores of 400 or better will tena to be at LPI Level 2+ 
or higher, that a comparable majority of those with scores between 
300 and 400 will tend to be at or above LPI Level 2, and so on. 
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The trends highlighted in Figure 9 constitute guidelines for 
making inferences from examinees' TOEIC-LC performance about their 
probable level of LPI performance. 34 By inference, they will be able 
to use English in face-to-face conversation as outlined in the 
detailed behavioral descriptions for the corresponding oral language 
proficiency levels that are provided in Appendix A, 

Perspective on the Distribution of LPI -Assessed 
Oral English Proficiency for TOEIC Examinees 

Because TOEIC score data are available for general samples of 
Japanese examinees, it was possible, using regression equations 
developed in the calibration sample, to estimate the distribution of 
oral English proficiency according to the behaviorally defined LPI 
scale for the subpopulation of Japanese TOEIC examinees. Equally 
comprehensive TOEIC score data are not yet available for general 
examinee subpopulations in France, Mexico, Saudi Arabia, and most 
other current TOEIC subpopulations. 

However, useful general normative inferences may be drawn from 
Figure 10, which shows relative frequency distributions (in polygon 
form) of LPI rating for the Japanese and the FMS (non- Japanese) TOEIC/ 
LPI-calibration samples, and for a general sample of Japanese TOEIC 
examinees . 



Figure 1 0. Distributions of LPI ratings in English for 
TOEIC samples, and of LPI ratings in French or Spanish 
for samples of teachers of these languages and of college 
seniors specializing in these languages in the U.S. 




Functional level: O (no proficiency) through 

5 (equivalent to educated native speaker) 

- Estimated distribution for SP examinees, from TOElC*l_C score. 
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In evaluating these distributions it is useful to recall that the 
FMS distribution reflects data for a sample with a TOEIC-LC mean to- 
ward the upper end of the "5-495" standard- score scale (428 for the 
French sample) , and a sample with a comparatively low LC mean (262 for 
the Mexican sample). As additional score-data become available, it 
will be possible to make more precise estimates of LPI distributions 
for the general TOEIC testing context. 

For general interpretive perspective, relative frequency distrib- 
utions of LPI ratings in French or Spanish are shown for samples of 
secondary- school teachers of these languages in the U.S. (histogram of 
average ratings from Hilton et al . , 1985), and U.S. college seniors 
specializing in the study of these languages (a relative frequency 
polygon adapted from estimates by Carroll, 1967). 

It is assumed for working purposes, that LPI ratings are gener- 
ally comparable across target languages . By combining information 
from Figure 9 and Figure 10, it is possible to draw interpretive in- 
ferences regarding (a) the nature of the distribution of LPI-assessea 
ability to use English in face-to-face conversation in the general 
TOEIC testing context, and (b) the functional proficiency of TOEIC 
examinees at certain score levels relative to that of defined sam- 
ples of second- language specialists -- that is, language students and 
language teachers. For example: 

o Most of the ESL users/learners likely to be tested with the TOEIC 
did not specialize in ESL during their educational careers. Rela- 
tively few of them are likely to earn ratings much beyond Level 3. 

However, this appears to hold as well for populations made up 
predominately of nonnative speakers of two target languages who 
specialized in the study of those languages -- samples of U.S. student 
specialists and language teachers who, by inference, are relatively 
highly selected on prof iciency- related variables. 

o Most of the foreign language teachers demonstrated LPI -assessed 
oral language proficiency judged to be at or above Level 2- -largely 
between Level 2 and Level 3, inclusive. Figure 9 shows that a 
majority of examinees with TOEIC Listening Comprehension scores of 
300 or higher are expected to demonstrate proficiency rated at Level 
2 or higher- -representing the attainment of ”... a highly usable 
set of .i'ills" (ETS , 1982b: 131). 

It follows that the distribution of LPI-assessed oral language 
proficiency for TOEIC examinees with LC scores of 300 or above is 
comparable to that for the focal population of "foreign language 
teachers . " 

The following comments by Lowe (1987: 45, emphasis added) are 
useful in light of evidence that relatively few academically- trained 
ESL users/learners in the TOEIC testing population are likely to be 
rated much above LPI Level 3. Lowe provides perspective regarding the 
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interpretation of interview performance rated at Level 3, as well as 
higher levels on the LPI(ILR) scale. 

The ILR scale is developmental in nature. At the summit the scale 
refers to the proficiency of an educated native speaker (ENS) 
This does not imply that all natives are at Level — 5. ENS status is 
normally acquired through long-term familiarization (from infancy to 
university graduate school) with varying kinds of language and 
social groups over a wide number of concrete and abstract subject 
areas. Although most individuals at Level 5 possess a diploma, ENS 
status is proven by the examinee's ability to use the language. ILR 

experience shows that the majority of native speakers — of — English 

probably fall at Level 3. In ILR experience, the — number — of 

nonnative Level 5's is miniscule . 

Comments by Carroll (1983: 102-103) about the spoken language 

skills of native speakers provide additional perspective on the 
difference between the "native - like " functional ability to exchange 
meaning in English that Lowe associates with LPI Level 3 and the 
equally "native - like " abilities associated with higher levels on the 
LPI scale . 

(S)tudy of the spoken language skills of native speakers may appear 
to be rather supererogatory, because at least at adult levels, 
native speakers have almost by definition acquired to a high degree 
the communicative skills that second language learners seek to 
acquire. Even young children have acquired many of these skills. 
Native speakers do not make the 'errors' in phonology, lexicon, and 
grammar that nonnatives make, even those who are fairly well 
advanced. If native speakers make errors in tests of 'grammar,' 
these tests often turn out to be tests of formalistic conventions 
associated with certain aspects of 'educated' speech and writing 
styles . . . ; they represent advanced phases of language development 

that go beyond the normal acquisition of a second language (emphasis 
added) . 

By inference, ESL users/learners who perform at LPI Level 3 
(attainable, according to Lowe, by a majority of native English 
speakers) , have acquired native-like "communicative" skills- -but not 
levels of advanced, "educated" English proficiency with respect to 
which native speakers themselves differ markedly (as evidenced, for 
example, by differences in performance on tests of "verbal ability" 
used in college admission) . 



Directions for Further Research on 
TOEIC/LPI Relationships 

For the Japanese-examinee subpopulation, the strength and con- 
sistency of TOEIC/LPI relationships has been amply documented- - in the 
present study, Woodford's (1982) validation study, and studies con- 
ducted by Japanese scholars specializing in English-language instruc- 
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tion and assessment (e.g., Saegusa, 1985). TOEIC/LPI relationships 
are similarly strong in initial samples from three developing TOEIC 
subpopulations . 

One can conclude, as a strong working hypothesis, that the 
pattern of TOEIC/LPI (predictor/criterion) relationships observed in 
these samples will be consistent across similarly selected samples of 
ESL users/learners in major national subpopulations in the larger 
TOEIC testing context. There is theoretical as well as evidentiary 
support for so concluding. However, evidence regarding TOEIC/LPI 
relationships in samples from other developed and developing TOEIC 
subpopulations is needed to permit a comprehensive empirical evalu- 
ation of this working hypothesis. TOEIC/LPI data-sets for samples 
from additional countries probably will be generated naturalistic - 
ally, as the data-sets employed in the present study were generated, 
during the course of operational assessments in TOEIC-use settings. 

As additional TOEIC/LPI data-sets become available from diverse 
settings, it w ill be possible to obtain empirical answers to the two 
questions that appear to be most pertinent: 

1. Is the pattern of TOEIC/LPI concurrent correlation^ consistent 
with that observed in the samples under consideration in the present 
study? 

2. Is the average level of criterion performance consistent with 
expectation based on the average level of performance on the TOEIC 
(as specified by guidelines developed in the general TOEIC/LPI - 
calibration sample available for the present study)? 



Potential Usefulness of Self-Ratings of 
Oral English Proficiency 

There is reason to believe that self-ratings of oral English 
proficiency may be a useful surrogate for actual interview ratings in 
research designed to assess consistency of patterns of relationships 
between TOEIC scores and level of oral English proficiency across di- 
verse national subpopulations and to identify nontest variables that 
may contribute to the prediction of LPI performance, after controlling 
for TOEIC scores , 

For example, there is evidence suggesting that "adult learners 
can assess their speaking and listening skills in much the same way as 
do their teachers" (Ingrar, 1985: 268), Hilton et al . (1985) reported 
relatively high correlations between self-ratings of speaking profic- 
iency (in Spanish and French) and the corresponding LPI ratings -- cor - 
relations were r — .66 and r — .69 in samples of French and Spanish 
teachers in the United States. 

Results of a self-assessment substudv . More direct evidence of 
the potential usefulness of self-ratings of oral English proficiency 
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as a surrogate criterion (for purposes of research) in the TOEIC 
testing context is provided by results of a special self-assessment 
substudy based on data collected by TOEIC/ETS staff in the sample of 
French examinees (see Table 7 and related discussion, above). Self- 
ratings of oral English proficiency were obtained using a rating scale 
with LPI-parallel level -descriptions (see Appendix B, Exhibit B.2). 

An analysis was made of interrelationships among TOEIC scores , 
self-ratings, and LPI ratings. Selected findings are shown in Table 
10 . 



Table 10 



Selected Findings of the Self-Assessment Subs.udy 
in the French Sample (N - 56) 



Variable 



TOEIC -LC 

TOEIC-R 

TOEIC-Total 

Self rating 
LPI rating 
Pred LPI.lc* 



Correlation with Mean SD 



Self- 


Interview 




rating 


rating 




.640 


.616 


428 74 


.501 


.583 


389 48 


.628 


.646 


817 113 




.643 


2.77 .93 


.643 


- - 


2.30 .64 


.640 


.616 


2.46 .39 



*LPI estimated from TOEIC-LC using the combined- sample regres- 
sion equation (see Table 8, above, and related discussion) . 



For present purposes, the most pertinent aspect of these findings 
is that the general level and pattern of relationships between TOEIC 
scores and self-rated oral English proficiency was. quite similar to 
that between TOEIC scores and the actual LPI-criterion measure (other 
aspects of these findings are discussed in Appendix B) . 

o Assuming that oral English proficiency is more closely related to 
TOEIC-LC than to TOEIC-R, the results obtained using self-ratings as 
the criterion and those obtained using actual LPI ratings (the 
surrogated criterion) lead to the same conclusion. 

Self-ratings of oral English proficiency (and other aspects of 
proficiency such as writing or reading) are obtainable on a routine 
basis as part of the regular test administration process, along with 
the responses of examinees to pertinent background questions (sex, 
age, educational level, extent of use of English on-the-job, job. cat- 
egories, type of employer, time spent in an English-speaking environ- 
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ment, and so on). J Self-ratings could be used as a surrogate cri- 
terion in studies designed to identify demographic, experiential, or 
other variables that contribute to the prediction of self -assessed 
oral English proficiency. The results of such studies should provide 
a basis for assessing and formulating working hypotheses regarding the 
nontest correlates of LPI performance.^ 

Further exploration of the usefulness of self-assessments of oral 
English proficiency, using a rating scale with LPI-parallel level- 
descriptions, is warranted (as is attention to the development of pro- 
cedures for collecting data on potentially relevant personal, demo- 
graphic, and experience variables in all major testing contexts- - 
along lines now well-established in the Japanese testing context) . 



Other Directions for Future TOEIC Research 

This study was not designed to obtain evidence regarding the 
relationship of TOEIC scores to directly assessed measures of English- 
language reading or writing skills in samples of TOEIC examinees. 
Woodford (1982) provided evidence that TOEIC scores were closely 
related to direct measures of both of these skills, as well as to LPI 
performance. However, Woodford did not rate the samples of reading 
and writing ability according to behaviorally defined levels that par- 
alleled those of the LPI oral language proficiency scale. 

Do Equal TOEIC-LC and TOEIC-R Scores Reflect 
"Comparable Levels of Proficiency?" 



In his study of the attainments of foreign language majors in the 
U.S., based on estimated functional levels of oral language profic- 
iency and reading ability, Carroll (1967) concluded that foreign lan- 
guage majors were generally less advanced in listening and speaking 
■skills than in reading and writing skills. 

Typically, the 'regular' cases had mean scores in Listening and 
Speaking that correspond to FSI ra -.ings of S-2 or S-2+, i.e., in the 
range of 'limited working proficiency.' In Reading and Writing, 
owever, the tested students tended to have mean scores that cor- 
respond approximately to an FSI rating of R-3 . . . (p. 199) 

Moreover, based on evidence introduced in Section II (see especi- 
ally Figures 2a, 2b, & 2c), educated ESL users/learners in the TOEFL 
testing context tend to be considerably more advanced, on the average, 
m reading ability than in the ability to comprehend spoken English 
or, y inference, in the ability to use English conversationally. 

Judging from the foregoing, the academically trained ESL users/ 
earners in the TOEIC testing context may tend to be more native -like, 
on the average, in their functional ability to read (and possibly to 
write) English, than they are in their ability to comprehend and 
produce utterances in English. To the extent that this is true 
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standard scores on TOEIC-LC and TOEIC-R (scores representing equal 
deviations from standardization- sample raw-score means on ^ the 
respective measures) may represent different levels of^ functional 
ability. It is important to obtain evidence bearing on this issue. 

The question of differences in level of skill-development could 
be addressed by obtaining evidence regarding the comparative perform- 
ance of native- English- speaking counterparts of, say, Japanese test- 
takers, on the LC and R sections of the TOEIC- -for example, in coop- 
erative studies designed to obtain TOEIC-score distributions for 
native - English - speaking employees and Japanese employees in comparable 
positions (sales, engineering, and so on) with particular companies. 

This question might also be addressed by using procedures for the 
direct assessment of reading (and writing) skills. Such procedures do 
not appear to have been widely used in operational testing settings. 
Carroll ' s (1967) use of the LPI-parallel procedure for rating reading 
proficiency may represent a unique application in the context of a 
large-scale assessment of functional levels of second- language ^.tc^lls 
in general populations of second- language users/learners . 

Assessment of developmental levels of reading and writing skills 
entails problems of behavior sampling that are not present to the same 
extent in assessing LPI performance. Detailed examination of problems 
associated with the application of LPI-parallel procedures for the 
direct assessment of reading and writing skills is beyond the scope of 
this paper. However, it is pertinent to note that the controlled con - 
versational interview is particularly useful as a basis for eliciting 
and rating second- language proficiency. This is so, in part, because 
it is an interactive assessment procedure that allows the interviewer 
to elicit and evaluate behavior in any area (e.g., functioning, regis- 
ter) deemed to be relevant for establishing an examinee's functional 
command of a target language. On the other hand, in assessing writing 
ability, for example, it is inherently more difficult to obtain a 
correspondingly representative sampling of pertinent behavior. 

It is possible that self-ratings of reading and writing ability 
according to a schedule with LPI-parallel level- descriptions might 
prove useful for research purposes in the TOEIC testing context. In 
the case of writing ability, graded samples of general business cor- 
respondence might be used as part of the self-assessment process. 



Need to Translate General Interpretive Guidelines 
into Context-Specific Interpretive Guidelines 

The chain of interpretive inference that has been validated in 
this study is def ini tionally limited. The evidence explicitly links 
TOEIC scores, directly, to one very clearly defined and generally 
important aspect of developed ability to use English as educated 
native speakers can be expected to use the language, namely, the 
ability to use English in face-to-face conversation as reflected in 
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performance in Language Proficiency Interviews conducted under 
controlled conditions by trained interviewers/raters. 

Setting Local Interpretive Guidelines 

Ultimately the information conveyed by TOEIC scores and LPI rat- 
ings needs to be linked (formally- statistically and/or clinically- 
intui tively) to defined criteria of the ability of ESL users/learners 
to use English in the workplace. This would entail evaluative judg- 
ments regarding the adequacy or relative adequacy of the performance 
of employees in specific ESL-dependent-positions - - that is, positions 
in which successful job performance is dependent to some extent upon 
ability to use English. 

Questions at issue in the workplace generally have to do with 
establishing the implications of test performance for ESL-profici- 
ency-related selection, placement, training, and j ob -class if ication 
decisions. Such decisions must be made within constraints imposed by 
a finite pool of employees or prospective employees with a particular 
joint distribution of English language skills, the amount of time and 
resources available for training designed to improve skill levels in 
the pool, and other practical considerations . 



The process of developing meaningful local (context-specific) in- 
terpretive guidelines needs to be guided by 

1 . a realistic assessment of the extent to which incumbents in ESL- 
dependent-positions are meeting the assignments associated with those 
positions in a manner that is considered satisfactory by general 
company standards, 

2. the assumption that different jobs require different levels and 
patterns of proficiency in English, and 

3. the premise that decisions regarding test-based minimum pro- 
ficiency requirements should take into account the actual score 
distributions of employees whose overall on-the-job performance is 
judged, by^ usual company standards, to be at least minimally satis- 
factory - -minimum proficiency requirements for getting the job done are 
likely to vary depending upon the job. 

In connection with the last point, it is important to recognize 
that the characterization of a particular level on the LPI scale as 
representing the attainment of "minimum working proficiency" or 
minimum professional proficiency" should not be thought of as a 
general guideline for making workplace decisions about levels of 
proficiency "required" for "successful" performance in particular ESL- 
dependent positions. 

The LPI scale evaluates linguistic behavior in terms of native 
speaker norms (expectations), not specifically in terms of level of 
functional "communicative competence" or "achievement of mutual 
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intelligibility," to use Savignon's (1986) terminology. 

(There is a distinction between) adoption of native speaker tiurips - - 
writing or speaking like a native speaker . . . --and the achieve- 
ment of mutual intelligibility- -communicating with native speakers" 
(Savingnon, 1986: p. 21); . . . (It is important to determine) the 
extent to which deviations of various kinds from native speaker 
norms interfere with mutual intelligibility. Psycholinguistic 
studies provide ample evidence that utterances may be interpretable 
without being 'natural/ i.e., native-like, and that semantically 
deviant utterances are more likely to be misinterpreted than are 
grammatically deviant utterances (pp. 22-23, emphasis added in all 
instances ) . 

The findings of this study (see especially Figure 10 and related 
discussion, above) suggest that only a small proportion of TOEIC exam- 
inees (with quite atypical English-language backgrounds) are likely to 
be able to write or speak like a native speaker of English- - that is, 
to attain "native speaker levels" on the LPI scale. Acceptance of the 
distinction made by Savignon implies pragmatic realism, not a "lower- 
ing of standards." 

In essence, LPI-scaled proficiency levels, like TOEIC scores, 
need to be validated against criteria of ability to accomplish work- 
place assignments that are contingent upon demonstrated ability to 
establish and maintain on-line communicative interaction at a level of 
"mutual intelligibility" that is sufficient for accomplishing the 
(business -related , or other) purposes of the interaction. 

The extent to which individuals below LPI Level 2 (or with TOEIC - 
LC scores below 300) are able to meet the communicative requirements 
of positions involving different levels of interaction wi^a native- 
English speakers is an important empirical question. 




Section V: GENERAL CONCLUDING OBSERVATIONS 



The Language Proficiency Interview procedure has very strong face 
validity as a measure of general ability to use English in face-to- 
face conversation. The linguistic demands imposed by participation in 
the interview situation are in many ways very similar to the linguis- 
tic demands associated with exchanging meaning on-the-job, in situa- 
tions calling for communicative interaction in English. It thus con- 
stitutes a relevant general criterion for establishing the represen- 
tational value of the TOEIC--that is, an expectancy- set based on 
knowledge of test-criterion relationships as to how well an individual 
is likely to be able to use English. 

It is evident that interpretive dividends have been realized in 
the TOEIC testing context by the use of a regression-based criterion 
referencing model in which performance in Language Proficiency Inter- 
views was used as a general context- independent criterion. This was 
expected, a priori, because scores on the LPI criterion have direct 
representational value, and the regression model, by definition, can 
be expected to indicate the extent to which TOEIC scores share the 
criterion measure's representational value. 

Knowledge of TOEIC/LPI relationships represents a clear inter- 
pretive advance because it permits test users to make statistically 
valid inferences from employees' TOEIC scores about their levels of 
developed oral English proficiency (as illustrated in Figure 9, above, 
for example). It also provides better- informed perspective regarding 
the level and range of oral English proficiency that academically 
trained ESL users/learners in the TOEIC testing context can be 
expected to exhibit (as illustrated in Figure 10, above). 

These interpretive dividends -- that obviously will be shared by 
TOEIC users and others interested in second- language assessment- -have 
accrued from the TOEIC program* a long-term investment in the devel- 
opment and maintenance of a strong direct assessment program to 
complement its major program of norm- referenced testing. The TOEIC 
experience in providing comprehensive assessment services in Japan and 
elsewhere in the world, indicates clearly that it is feasible for a 
large-scale ESL proficiency testing program to develop and maintain a 
strong operational capability for the direct assessment of oral 
English proficiency. This is achieved by offering in strategic 
locations the type of education, training, and periodic "recali- 
bration" needed to facilitate the development of a cadre of program- 
related, resident ESL professionals highly skilled in the use of the 
LPI procedure. 

The TOEIC direct assessment program has contributed >vel evi- 
dence regarding the probable level and range of functional ability to 
use English in face - face -conversation for a potentially very large 
population of ESL users/learners. This evidence clarifies tne ef- 
fective range of developed oral English proficiency being assessed by 
the TOEIC. 
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Finally, the results of this study attest to the elemental clari- 
ty of Carroll's (1967) insight that the interpretive power inherent in 
behaviorally scaled direct assessments could be harnessed- -by empiri- 
cal linkage rules established in samples from defined populations -- to 
psychometrically more efficient norm- referenced measures of language 
macroskills, and thus be extended to the populations involved.^ 



NOTES TO TEXT 
Section I 

1. The terms "English as a second language" (ESL) and "English as a 
foreign language" (EFL) are used interchangeably in this report, in a 
generic sense, to indicate that the study is concerned with acquired 
proficiency in English in samples of nonnative speakers -- individuals 
for whom English is a foreign or second language- -regardless of the 
purpose for which they acquired, are using, or expect to use English 
(e.g., whether they use it on a casual basis for personal and social 
reasons, or daily for travel, work, or study in an English-speaking 
environment) . The focal population of test takers is composed of 
highly educated, adult ESL users/learners whose patterns of ESL 
acquisition include a core of academic exposure to the study of 
English as a foreign language. 

2. For example, although the TOEFL (ETS, 1985a) is very widely used to 
screen ESL applicants for admission to U.S. colleges and universities, 
and the users are advised to make followup studies designed to link 
TOEFL score levels to ESL-communication-related performance criteria, 
no examples of such institutional studies were found in a comprehens- 
ive summary of research involving the TOEFL between 1963 and 1982 
(Hale, Stansfield, & Duran, 1984). See Ingram (1985: 237-239) for a 
commentary on the problem of conducting research designed to establish 
the functional implications of scores on indirect, norm- ref er^nced 
tests , 

3. To supplement a search of the literature, the writer queried 
Professor Carroll, by telephone, as to whether he knew of any subse- 
quent study of this kind. Professor Carroll said that he knew of 
none, and attributed the apparent lack of replication to the costs and 
the administrative and logistical difficulties involved in obtaining 
the direct assessments. 

Section II 

4. Problems of this nature are inherent in validation research 
involving context- specif ic performance criteria- - that is, they are not 
peculiar to studies involving context- specif ic criteria of ability to 
use a target language to accomplish defined language -essential tasks. 
In assessing the predictive validity of norm-referenced college 
admission tests, for example, studies relating test scores to first- 
year grade point average (GPA)--a "context- specif ic" academic 
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